wlandau / crew.cluster

crew launcher plugins for traditional high-performance computing clusters
https://wlandau.github.io/crew.cluster
Other
26 stars 9 forks source link

Slurm-pipeline occasionally hangs (likely when not enough resources are available) #45

Closed koefoeden closed 2 weeks ago

koefoeden commented 2 weeks ago

Prework

Description

Hi! I'm using the slurm-scheduler, and have issues with the pipeline hanging at certain targets. First, I thought it was associated with the new error = "trim" in targets, but this does not seem to be necessary for the bug to occur - only makes it more frequent/easier to happen I think.

I have experimented a bit with the necessary requirements for the bug to occur, and it seems that the following very simple scenario provokes it: Dispatching targets on different controllers that exceed the total resources (CPU cores or RAM) on the cluster/allowed node. This happens even if the second target is dependent on the first, and therefore should run after the initial target.

Reproducible example

Below, I have set up a reprex that exhausts the available resources given the --nodelist requirement.

library(targets)
a_ctrl <- crew.cluster::crew_controller_slurm(name = "a",  
                                              workers = 1, 
                                              slurm_memory_gigabytes_required=1,
                                              slurm_cpus_per_task=100, # 128 cores on the allowed node
                                              script_lines =  "#SBATCH --nodelist=esrumcmpn10fl") # artificially limit to this single empty node to not block my colleagues

b_ctrl <- crew.cluster::crew_controller_slurm(name = "b",  
                                              workers = 1, 
                                              slurm_memory_gigabytes_required=1, 
                                              slurm_cpus_per_task=30, # 128 cores on the allowed node
                                              script_lines =  "#SBATCH --nodelist=esrumcmpn10fl")

tar_option_set(controller = crew::crew_controller_group(a_ctrl, b_ctrl))

list(
    tar_target(name = a, 
               command = sessionInfo(), 
               resources = tar_resources(crew = tar_resources_crew(controller = "a"))),
    tar_target(name = b, 
               command = sessionInfo(), 
               resources = tar_resources(crew = tar_resources_crew(controller = "b")))
)

Expected result

The pipeline should be able to finish the jobs that it starts, one at a time within the resource limits - eventually finishing the whole pipeline.

Diagnostic information

Output from the pipeline:

▶ dispatched target a ▶ dispatched target b ● completed target a [0.05 seconds, 38.972 kilobytes] ... hangs indefinitzely

Output from squeue This output shows how the initial worker that fit within the resource requirements are created and launched a SLURM-job, but is never appropriately shut down (even though it actually finishes its task), which would otherwise free up resources and allowing the other worker to launch and do its job. image

Session info:

R version 4.3.3 (2024-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux 8.10 (Ootpa)

Matrix products: default BLAS/LAPACK: /maps/direct/software/openblas/0.3.24/lib/libopenblasp-r0.3.24.so; LAPACK version 3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Copenhagen tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices datasets utils methods base

other attached packages: [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 [9] ggplot2_3.5.1 tidyverse_2.0.0 targets_1.8.0.9002

loaded via a namespace (and not attached): [1] crew.cluster_0.3.2.9005 utf8_1.2.4 generics_0.1.3 [4] renv_1.0.10 xml2_1.3.6 stringi_1.8.4 [7] hms_1.1.3 magrittr_2.0.3 grid_4.3.3 [10] timechange_0.3.0 autometric_0.0.5.9000 processx_3.8.4 [13] backports_1.5.0 secretbase_1.0.3 promises_1.3.0 [16] ps_1.8.0 fansi_1.0.6 scales_1.3.0 [19] crew_0.9.5.9012 codetools_0.2-19 cli_3.6.3 [22] rlang_1.1.4 munsell_0.5.1 withr_3.0.1 [25] yaml_2.3.10 tools_4.3.3 tzdb_0.4.0 [28] getip_0.1-4 nanonext_1.3.0 colorspace_2.1-1 [31] base64url_1.4 vctrs_0.6.5 R6_2.5.1 [34] lifecycle_1.0.4 pkgconfig_2.0.3 callr_3.7.6 [37] later_1.3.2 pillar_1.9.0 gtable_0.3.5 [40] Rcpp_1.0.13 data.table_1.16.0 glue_1.8.0 [43] xfun_0.48 tidyselect_1.2.1 knitr_1.48 [46] mirai_1.2.0 igraph_2.0.3 compiler_4.3.3

SHA-1 hash: f22fd617eb76191a1dff643ac7cd9c65b92f33d5

Essentially, it seems that there is some lack of communication between the controllers and the SLURM-system regarding cases of transiently limited resources and how to handle it.

wlandau commented 2 weeks ago

Exceeding the current resources available on the cluster

What kinds of resources exactly? Job quota, requested memory, something else?

I wonder if this is related to #1329.

If you are exceeding available resources, that sounds like a problem at the platform level, not the level of crew, crew.cluster, or targets.

wlandau commented 2 weeks ago

I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.

library(targets)
library(crew)

targets::tar_option_set(
  storage = "worker",
  retrieval = "worker",
  deployment = "worker",
  controller = crew_controller_group(
    crew_controller_local(workers = 20, name = "a"),
    crew_controller_local(workers = 20, name = "b")
  ),
  resources = targets::tar_resources(
    crew = targets::tar_resources_crew(controller = "a")
  )
)

list(
  tar_target(index, seq_len(100)),
  tar_target(a, {message("b"); Sys.sleep(5)}, pattern = map(index)),
  tar_target(
    b,
    {message("b"); a},
    pattern = map(a), 
    resources = tar_resources(crew = tar_resources_crew(controller = "b"))
  )
)
koefoeden commented 2 weeks ago

I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.

Yes, it seems that it is a SLURM-specific issue, since I was also not able to reproduce it using the local-controller.

What kinds of resources exactly? Job quota, requested memory, something else?

My initial reprex was with requested memory. Just tried with CPU-cores, and the result is the same.

I wonder if this is related to #1329.

I can't seem to find this issue - can you provide me a link?

If you are exceeding available resources, that sounds like a problem at the platform level, not the level of crew, crew.cluster, or targets.

Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster? I.e. it should be able to handle that tasks/targets are sometimes queued until more resources are available, and still finish the tasks that are started, eventually allowing the remaining tasks to start and finish?

koefoeden commented 2 weeks ago

Please see the updated, much more simple reprex. I removed the branching points, sys.sleep calls, target-target dependencies, and the issue still remains Also, I deleted some wrong comments and added clarifications.

wlandau commented 2 weeks ago

My initial reprex was with requested memory.

I have a new package called autometric to prospectively log resource usage like memory. It is integrated with development crew: https://wlandau.github.io/crew/articles/logging.html

Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster?

No. crew requests resources based on how many tasks need to be done and how many workers you allow with the workers argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers.

Please see the updated, much more simple reprex.

This does help understand what is happening. b_ctrl cannot launch a worker because a_ctrl is already running one, and a_ctrl does not relinquish its worker because the it uses defaults seconds_idle = Inf and tasks_max = Inf. Either seconds_idle = 10 or tasks_max = 1 should resolve the deadlock.

koefoeden commented 2 weeks ago

I have a new package called autometric to prospectively log resource usage like memory. It is integrated with development crew: https://wlandau.github.io/crew/articles/logging.html

Cool! I'll make sure to check it out!

No. crew requests resources based on how many tasks need to be done and how many workers you allow with the workers argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers

I agree that it is poor practice to max out the cluster - however, it is not unrealistic for this to happen once in a while because it is a shared cluster which others might be maxing it out occassionally. But this is also a moot point if it can be fixed by setting the seconds_idle parameter - I'll report back!

koefoeden commented 2 weeks ago

Can confirm that seconds_idle = 10 fixed this particular issue - thanks!