Controller hangs perpetually if a Slurm worker is killed via OOM

multimeric commented 1 year ago

Prework

[x] Read and agree to the Contributor Code of Conduct and contributing guidelines.
[x] Confirm that your issue is a genuine bug in the crew.cluster package itself and not a user error, known limitation, or issue from another package that crew.cluster depends on.
[x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[x] Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [x] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [x] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [x] Readable: format your code according to the tidyverse style guide.

Description

When you submit a task to the Slurm controller which uses greater memory than allocated by Slurm, the Slurm job itself will fail with OOM, but the controller just hangs forever instead of reporting this fatal error.

Reproducible example

The following will run forever:

library(crew.cluster)
controller = crew_controller_slurm(
    workers = 3,
    script_lines = "module load R",
    slurm_memory_gigabytes_per_cpu = 1,
    slurm_cpus_per_task = 1
  )
controller$start()
controller$push(name = "foo", command = rnorm(10E9))
controller$wait()

Expected result

In the above case, I expect the worker to be submitted, and the command to be attempted, as it currently does. However at the point of the worker being killed, I would expect crew (mirai?) to detect this, and report it as a pipeline failure.

Diagnostic information

Session Info:

``` R version 4.3.0 (2023-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRblas.so LAPACK: /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C time zone: Australia/Melbourne tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] crew.cluster_0.1.1 loaded via a namespace (and not attached): [1] mirai_0.9.0 utf8_1.2.3 R6_2.5.1 tidyselect_1.2.0 [5] magrittr_2.0.3 crew_0.4.0 nanonext_0.9.1 glue_1.6.2 [9] tibble_3.2.1 pkgconfig_2.0.3 lifecycle_1.0.3 ps_1.7.5 [13] cli_3.6.1 fansi_1.0.4 processx_3.8.2 vctrs_0.6.2 [17] getip_0.1-3 data.table_1.14.8 compiler_4.3.0 pillar_1.9.0 [21] rlang_1.1.1 ```

wlandau commented 1 year ago

In your situation, I think the code appears to hang because crew re-launches the worker and the task when the worker crashes from an OOM issue. To see this, you could run squeue and see the new worker launch. To detect OOM and other issues, I suggest logging with slurm_log_output and slurm_log_error.

In the general case: to monitor workers, crew relies on the network connections created by NNG/nanonext/mirai. In mirai, @shikokuchuo decided to handle this problem by resending the task, which will then make crew re-launch the worker: https://github.com/shikokuchuo/mirai/issues/20. Back then, I thought I would be able to realistically make this fault tolerance optional, but I no longer think this is feasible based on how auto-scaling works and how it interacts with mirai (c.f. https://github.com/wlandau/crew/issues/79). Not unless I refactor crew to bypass mirai and call nanonext directly.

shikokuchuo commented 1 year ago

We can do a bit better in this case. mirai (current dev version) will now give you the option to terminate such tasks by using mirai::saisei(i, force = TRUE) where i is the index number of the dispatcher URL the task is at. The problematic task will be returned as an 'errorValue' 7 - object closed.

Without the change - the task is perpetually in the state of non-completion. So when it crashes, crew launches another worker instance to fulfill the task etc. There is no active decision to resend the task - the protocols used ensure that this is the case. We instead have to tear down the socket and task with it, which is what saisei(force = TRUE) now does.

With mirai::status() you may be able to see which is the problematic dispatcher URL as the 'assigned' column will be greater than the 'completed' column (assuming all your other tasks have completed).

multimeric commented 1 year ago

In your situation, I think the code appears to hang because crew re-launches the worker and the task when the worker crashes from an OOM issue. To see this, you could run squeue and see the new worker launch. To detect OOM and other issues, I suggest logging with slurm_log_output and slurm_log_error.

Yep, this is definitely what is happening, and I can see a new job submitted every 20 seconds. If I set slurm_log_error and tail -f the file, I get slurmstepd: error: Detected 1 oom_kill event in StepId=12949601.batch. Some of the step tasks have been OOM Killed.. So this is a nice start.

However, the error file gets truncated whenever a new worker is submitted, so a user has to actively watch the file as I have done because cating it won't necessarily show anything useful. Can we avoid truncating the error log at all to ensure this error is easy to catch? I guess this is happening each time the file is re-opened for writing.

Also, It would be really helpful if there was a user friendly way to watch this log file from the R terminal, or perhaps have a general controller log mechanism that the Slurm controller implements by reading the Slurm error log. As it stands, it's a multi-step process that requires a bit of Linux know-how to pull off, whereas I'm hoping for something that an R-only user can handle.

multimeric commented 1 year ago

Thanks @shikokuchuo. Will Targets benefit from this change automatically, or will it have to first check for this type of error first and then explicitly terminate the worker?

In terms of user friendliness, it would be nice to have an option on the crew end that's something like "if the worker dies N times, stop retrying it", and the user can customize N. If we could detect non-retryable errors like OOM and automatically stop retrying that would be even better, but I suspect that would be much harder to do.

wlandau commented 1 year ago

We can do a bit better in this case. mirai (current dev version) will now give you the option to terminate such tasks by using mirai::saisei(i, force = TRUE) where i is the index number of the dispatcher URL the task is at. The problematic task will be returned as an 'errorValue' 7 - object closed.

Without the change - the task is perpetually in the state of non-completion. So when it crashes, crew launches another worker instance to fulfill the task etc. There is no active decision to resend the task - the protocols used ensure that this is the case. We instead have to tear down the socket and task with it, which is what saisei(force = TRUE) now does.

Very encouraging, thanks for implementing that.

With mirai::status() you may be able to see which is the problematic dispatcher URL as the 'assigned' column will be greater than the 'completed' column (assuming all your other tasks have completed).

Unfortunately this does not seem feasible, since the assigned > complete backlog happens a lot even in normal use cases like https://github.com/wlandau/crew/blob/main/tests/throughput/test-backlog-tasks_max.R where launching another worker instance is the correct decision. That is why issues https://github.com/wlandau/crew/issues/75 and https://github.com/wlandau/crew/issues/79 originally came up.

However, the error file gets truncated whenever a new worker is submitted, so a user has to actively watch the file as I have done because cating it won't necessarily show anything useful. Can we avoid truncating the error log at all to ensure this error is easy to catch? I guess this is happening each time the file is re-opened for writing.

I recommend creating a new log file for each new worker. In SGE, if you set the log to be a directory (with a trailing slash) then different workers have different files. In SLURM, I believe there is a way to do this with text patterns.

Also, It would be really helpful if there was a user friendly way to watch this log file from the R terminal, or perhaps have a general controller log mechanism that the Slurm controller implements by reading the Slurm error log. As it stands, it's a multi-step process that requires a bit of Linux know-how to pull off, whereas I'm hoping for something that an R-only user can handle.

This is very dependent on the platform and gets outside of the things that crew can reasonably assume about the way each scheduler works internally. Best case scenario is that I implement something I think exactly replicates the log file path policy of SLURM and then hope it works (but will probably break). So I think it best to leave it out of scope.

Will Targets benefit from this change automatically, or will it have to first check for this type of error first and then explicitly terminate the worker?

If there is a path forward, it should automatically carry forward to any targets pipeline that uses it.

In terms of user friendliness, it would be nice to have an option on the crew end that's something like "if the worker dies N times, stop retrying it", and the user can customize N. If we could detect non-retryable errors like OOM and automatically stop retrying that would be even better, but I suspect that would be much harder to do.

It might be possible to do this at the level of a controller as a whole, if there is a path forward.

multimeric commented 1 year ago

This is very dependent on the platform and gets outside of the things that crew can reasonably assume about the way each scheduler works internally. Best case scenario is that I implement something I think exactly replicates the log file path policy of SLURM and then hope it works (but will probably break). So I think it best to leave it out of scope.

Hmm. This would be of reasonable value to my workplace (where we hope to use Targets + Slurm a decent amount). If I could find time to help, would you be interested in help with such a feature? I think ideally it would be a general logs() R6 method on the controller class, but if that's too large a change, I'm happy to just write a slurm_logs() method just for the Slurm controller. It's possible we can use sattach as a more robust alternative to reading the log file, but I'd have to look into it.

wlandau commented 1 year ago

I have mixed feelings about a built-in log reader, but we can talk more if you find a robust way to read them no matter what the user sets for the log path arguments (or if necessary, a robust way to check if logs can be read). The method would fit best in the SLURM launcher, which is what varies from plug-in to plug-in. The controller that contains it is generic.

wlandau / crew.cluster