Closed wlandau closed 4 weeks ago
All modified and coverable lines are covered by tests :white_check_mark:
Please upload report for BASE (
main@f0d6087
). Learn more about missing BASE report. Report is 10 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
My comment in the code above was not entirely correct. When a worker succeeds (completes all its tasks) and exits normally, the next launch will reset to 16 GB of memory. @stemangiola, do you think this is reasonable? An alternative is to stay at the last memory level that worked, but it may be more resources than necessary, and ultimately I think users could request more memory from the start if there are excessive retries.
[@wlandau EDITED] Resetting is reasonable. If I take my case, I have many small inputs, and some big, and it is hard to partition them properly.
So I would hope that if inputBIG
goes to worker A
, it fails (I guess that worker A
stays alive (?), or dies if one task fails?); it gets launched to worker B
immediately (with priority) with increased resources; when it succeeds, everything can reset expecting small inputs. If worker B
lives a bit longer and we have inefficiencies, I think it is reasonable. I usually set task max =5.
Even better if I could set task_max as a vector = c(100, 1, 1, 1) with memory_Gb = c(5, 20, 40, 200).
Thanks, that helps. And yes, stopped workers with unresolved tasks get launched immediately on the next call to launcher$scale()
. In the code, the are called "backlogged workers": https://github.com/wlandau/crew/blob/da957aea7a58f147b9c9d424148b166a830e2d5e/R/crew_launcher.R#L874-L875
Thinking about it, form a user perspective, it would be nice to have the retry visible to the user. For example something like
"Submitted batch job 19176672 - retried with these resources ..."
The latest few commits add a brief retry message (without specific resources because those are case-by-case and potentially long). Enabled when verbose = TRUE
in the cluster options.
Prework
Related GitHub issues and pull requests
Summary
This PR implements retryable options for all
crew.cluster
plugins. Certain arguments ofcrew_options_slurm()
etc. can now be vectors, where each successive element applies to the next worker retry and the last element is used if the number of retries exceeds the length of the vector. You can control the number of allowable retries with thecrashes_error
argument of the controller.Example with SGE and the
sge_memory_gigabytes_required
field:Memory, CPU, GPU, wall time, and SLURM partition are all retryable now.