Closed koefoeden closed 2 months ago
Exceeding the current resources available on the cluster
What kinds of resources exactly? Job quota, requested memory, something else?
I wonder if this is related to #1329.
If you are exceeding available resources, that sounds like a problem at the platform level, not the level of crew
, crew.cluster
, or targets
.
I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.
library(targets)
library(crew)
targets::tar_option_set(
storage = "worker",
retrieval = "worker",
deployment = "worker",
controller = crew_controller_group(
crew_controller_local(workers = 20, name = "a"),
crew_controller_local(workers = 20, name = "b")
),
resources = targets::tar_resources(
crew = targets::tar_resources_crew(controller = "a")
)
)
list(
tar_target(index, seq_len(100)),
tar_target(a, {message("b"); Sys.sleep(5)}, pattern = map(index)),
tar_target(
b,
{message("b"); a},
pattern = map(a),
resources = tar_resources(crew = tar_resources_crew(controller = "b"))
)
)
I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.
Yes, it seems that it is a SLURM-specific issue, since I was also not able to reproduce it using the local-controller.
What kinds of resources exactly? Job quota, requested memory, something else?
My initial reprex was with requested memory. Just tried with CPU-cores, and the result is the same.
I wonder if this is related to #1329.
I can't seem to find this issue - can you provide me a link?
If you are exceeding available resources, that sounds like a problem at the platform level, not the level of
crew
,crew.cluster
, ortargets
.
Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster? I.e. it should be able to handle that tasks/targets are sometimes queued until more resources are available, and still finish the tasks that are started, eventually allowing the remaining tasks to start and finish?
Please see the updated, much more simple reprex. I removed the branching points, sys.sleep calls, target-target dependencies, and the issue still remains Also, I deleted some wrong comments and added clarifications.
My initial reprex was with requested memory.
I have a new package called autometric
to prospectively log resource usage like memory. It is integrated with development crew
: https://wlandau.github.io/crew/articles/logging.html
Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster?
No. crew
requests resources based on how many tasks need to be done and how many workers you allow with the workers
argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers
.
Please see the updated, much more simple reprex.
This does help understand what is happening. b_ctrl
cannot launch a worker because a_ctrl
is already running one, and a_ctrl
does not relinquish its worker because the it uses defaults seconds_idle = Inf
and tasks_max = Inf
. Either seconds_idle = 10
or tasks_max = 1
should resolve the deadlock.
I have a new package called autometric to prospectively log resource usage like memory. It is integrated with development crew: https://wlandau.github.io/crew/articles/logging.html
Cool! I'll make sure to check it out!
No. crew requests resources based on how many tasks need to be done and how many workers you allow with the workers argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers
I agree that it is poor practice to max out the cluster - however, it is not unrealistic for this to happen once in a while because it is a shared cluster which others might be maxing it out occassionally. But this is also a moot point if it can be fixed by setting the seconds_idle parameter - I'll report back!
Can confirm that seconds_idle = 10 fixed this particular issue - thanks!
Prework
crew.cluster
package itself and not a user error, known limitation, or issue from another package thatcrew.cluster
depends on.Description
Hi! I'm using the slurm-scheduler, and have issues with the pipeline hanging at certain targets. First, I thought it was associated with the new error = "trim" in targets, but this does not seem to be necessary for the bug to occur - only makes it more frequent/easier to happen I think.
I have experimented a bit with the necessary requirements for the bug to occur, and it seems that the following very simple scenario provokes it: Dispatching targets on different controllers that exceed the total resources (CPU cores or RAM) on the cluster/allowed node. This happens even if the second target is dependent on the first, and therefore should run after the initial target.
Reproducible example
Below, I have set up a reprex that exhausts the available resources given the --nodelist requirement.
Expected result
The pipeline should be able to finish the jobs that it starts, one at a time within the resource limits - eventually finishing the whole pipeline.
Diagnostic information
Output from the pipeline:
Output from squeue This output shows how the initial worker that fit within the resource requirements are created and launched a SLURM-job, but is never appropriately shut down (even though it actually finishes its task), which would otherwise free up resources and allowing the other worker to launch and do its job.
Session info:
SHA-1 hash: f22fd617eb76191a1dff643ac7cd9c65b92f33d5
Essentially, it seems that there is some lack of communication between the controllers and the SLURM-system regarding cases of transiently limited resources and how to handle it.