wlandau / crew.cluster

crew launcher plugins for traditional high-performance computing clusters
https://wlandau.github.io/crew.cluster
Other
27 stars 9 forks source link

Retryable options #49

Closed wlandau closed 4 weeks ago

wlandau commented 4 weeks ago

Prework

Related GitHub issues and pull requests

Summary

This PR implements retryable options for all crew.cluster plugins. Certain arguments of crew_options_slurm() etc. can now be vectors, where each successive element applies to the next worker retry and the last element is used if the number of retries exceeds the length of the vector. You can control the number of allowable retries with the crashes_error argument of the controller.

Example with SGE and the sge_memory_gigabytes_required field:

controller <- crew_controller_sge(
  workers = 16,
  # Try 16 GB memory first, then use 32 GB to retry if the worker crashes,
  # then 64 GB for all subsequent retries if the previous one failed.
  # Next time the worker exits normally after completing all its tasks,
  # memory returns back to 16 GB.
  sge_memory_gigabytes_required = c(16, 32, 64)
)

Memory, CPU, GPU, wall time, and SLURM partition are all retryable now.

codecov[bot] commented 4 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Please upload report for BASE (main@f0d6087). Learn more about missing BASE report. Report is 10 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #49 +/- ## ======================================== Coverage ? 100.00% ======================================== Files ? 21 Lines ? 1243 Branches ? 0 ======================================== Hits ? 1243 Misses ? 0 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

wlandau commented 4 weeks ago

My comment in the code above was not entirely correct. When a worker succeeds (completes all its tasks) and exits normally, the next launch will reset to 16 GB of memory. @stemangiola, do you think this is reasonable? An alternative is to stay at the last memory level that worked, but it may be more resources than necessary, and ultimately I think users could request more memory from the start if there are excessive retries.

stemangiola commented 4 weeks ago

[@wlandau EDITED] Resetting is reasonable. If I take my case, I have many small inputs, and some big, and it is hard to partition them properly.

So I would hope that if inputBIG goes to worker A, it fails (I guess that worker A stays alive (?), or dies if one task fails?); it gets launched to worker B immediately (with priority) with increased resources; when it succeeds, everything can reset expecting small inputs. If worker B lives a bit longer and we have inefficiencies, I think it is reasonable. I usually set task max =5.

Even better if I could set task_max as a vector = c(100, 1, 1, 1) with memory_Gb = c(5, 20, 40, 200).

wlandau commented 4 weeks ago

Thanks, that helps. And yes, stopped workers with unresolved tasks get launched immediately on the next call to launcher$scale(). In the code, the are called "backlogged workers": https://github.com/wlandau/crew/blob/da957aea7a58f147b9c9d424148b166a830e2d5e/R/crew_launcher.R#L874-L875

stemangiola commented 4 weeks ago

Thinking about it, form a user perspective, it would be nice to have the retry visible to the user. For example something like

"Submitted batch job 19176672 - retried with these resources ..."

wlandau commented 3 weeks ago

The latest few commits add a brief retry message (without specific resources because those are case-by-case and potentially long). Enabled when verbose = TRUE in the cluster options.