wlandau / crew.cluster

crew launcher plugins for traditional high-performance computing clusters
https://wlandau.github.io/crew.cluster
Other
27 stars 9 forks source link

Issue running with Singularity on SLURM #33

Closed drejom closed 10 months ago

drejom commented 10 months ago

Prework

Description

After updating a slew of packages recently, my SLURM-enabled targets pipeline has stopped running, with errors about seconds_timeout. I have a rather elaborate script to setup cluster operations, but I think I've narrowed it down to crew_controller_slurm(), so only post that here:

Reproducible example

# working on it

Apologies @wlandau my initial example was not in fact reproducible, but while I see if I can make a minimal example, does this {targets} error give any clues as to what's going on? It occurs with or without seconds_timeout set in crew.cluster::crew_controller_slurm()

Error:
! Error running targets::tar_make()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    all(is.numeric(.)) && all(length(.) == 1L) && all(!anyNA(.)) && all(. >= 0) is not true on . = seconds_timeout
Last error traceback:
    tryCatch(withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na...
    doTryCatch(return(expr), name, parentenv, handler)
    tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("/tmp/Rtmp...
    saveRDS(do.call(do.call, c(readRDS("/tmp/RtmpgNN5X8/callr-fun-136c8f4b59...
    do.call(do.call, c(readRDS("/tmp/RtmpgNN5X8/callr-fun-136c8f4b59c78e"), ...
    (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is....
    (function (targets_function, targets_arguments, options, envir = NULL, s...
    tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ...
    targets::tar_callr_inner_try(targets_function = targets_function, target...
    do.call(targets_function, targets_arguments)
    (function (pipeline, path_store, names_quosure, shortcut, reporter, seco...
    crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)...
    self$run_crew()
    self$iterate()
    if_any(queue$should_dequeue(), self$process_target(queue$dequeue()), sel...
    self$controller$wait(mode = "one", seconds_interval = interval, seconds_...
    if_any(identical(mode, "one"), private$.wait_one(controllers = control, ...
    private$.wait_one(controllers = control, seconds_interval = seconds_inte...
    crew_retry(fun = ~{ if (scale) { walk(controllers, ~.x$scale(throttle = ...
    crew_assert(seconds_timeout, is.numeric(.), length(.) == 1L, !anyNA(.), ...
    crew_error(message %|||% out)
    crew_stop(message = message, class = c("crew_error", "crew"))
    rlang::abort(message = message, class = class, call = emptyenv())
    signal_abort(cnd, .file)

Diagnostic information

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] crew.cluster_0.2.0 targets_1.4.1      lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.4        tidyr_1.3.0       
[10] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     stringi_1.8.3      hms_1.1.3          digest_0.6.33      magrittr_2.0.3     grid_4.3.1         timechange_0.2.0   jsonlite_1.8.8    
[10] processx_3.8.3     backports_1.4.1    ps_1.7.5           fansi_1.0.6        scales_1.3.0       crew_0.8.0         codetools_0.2-19   cli_3.6.2          rlang_1.1.2       
[19] munsell_0.5.0      withr_2.5.2        yaml_2.3.8         parallel_4.3.1     tools_4.3.1        tzdb_0.4.0         getip_0.1-4        nanonext_0.11.0    colorspace_2.1-0  
[28] base64url_1.4      vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4    pkgconfig_2.0.3    callr_3.7.3        pillar_1.9.0       gtable_0.3.4       glue_1.6.2        
[37] data.table_1.14.10 xfun_0.41          tidyselect_1.2.0   knitr_1.45         mirai_0.11.3       igraph_1.6.0       compiler_4.3.1    
drejom commented 10 months ago

Ok, so downgrading {targets} to 1.2.2 solved things for now and I can run my analysis.

However, I see a number of changes in 1.3.0 which i suspect account for the error.

I can run the targets-minimal pipeline without issue, but when i include the following to run it on SLURM, I get errors.

nodename <- Sys.info()["nodename"]

singularity_exec <- glue::glue("cd {here::here()} \\
/{base_dir}/easy-build/software/singularity/3.7.0/bin/singularity exec \\
--env R_LIBS_USER=~/R/bioc-3.17 \\
--env R_LIBS_SITE=/{base_dir}/singularity/shared_cache/rbioc/rlibs/bioc-3.17 \\
-B /{base_dir}/singularity,/ref_genomes,/scratch \\
/{base_dir}/singularity/shared_cache/rbioc/vscode-rbioc_3.17.sif \\")

slurm <- crew.cluster::crew_controller_slurm(
    host = nodename,
    script_lines = singularity_exec)

tar_option_set(
    controller = slurm,
    resources = tar_resources(
        crew = tar_resources_crew(seconds_timeout = 3)
        )
    )
targets::tar_make()
▶ dispatched target raw_data_file
▶ completed pipeline [6.776 seconds]
Error:
! Error running targets::tar_make()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    target NA error: 'errorValue' int 5 | Timed out
Last error traceback:
    tryCatch(withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na...
    doTryCatch(return(expr), name, parentenv, handler)
    tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("/tmp/Rtmp...
    saveRDS(do.call(do.call, c(readRDS("/tmp/RtmpgCy87w/callr-fun-15f2fb7d6d...
    do.call(do.call, c(readRDS("/tmp/RtmpgCy87w/callr-fun-15f2fb7d6d7032"), ...
    (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is....
    (function (targets_function, targets_arguments, options, envir = NULL, s...
    tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ...
    targets::tar_callr_inner_try(targets_function = targets_function, target...
    do.call(targets_function, targets_arguments)
    (function (pipeline, path_store, names_quosure, shortcut, reporter, seco...
    crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)...
    self$run_crew()
    self$iterate()
    self$conclude_worker_task()
    tar_assert_all_na(result$error, msg = paste("target", result$name, "erro...
    tar_throw_validate(msg %|||% default)
    tar_error(message = paste0(...), class = c("tar_condition_validate", "ta...
    rlang::abort(message = message, class = class, call = tar_empty_envir)
    signal_abort(cnd, .file)

If I remove the resources section from tar_option_set():

resources = tar_resources(
        crew = tar_resources_crew(seconds_timeout = 3)
        )

I get no error, but the pipeline never progresses beyond dispatching the first target:

targets::tar_make()
▶ dispatched target raw_data_file
/

Apologies if I'm missing something obvious, but are you able to provide any insight?