implement nextflow feature of rerun job with larger resources if not enough

wlandau / crew.cluster

crew launcher plugins for traditional high-performance computing clusters

https://wlandau.github.io/crew.cluster

Other

27 stars 9 forks source link

implement nextflow feature of rerun job with larger resources if not enough #48

Closed stemangiola closed 3 weeks ago

stemangiola commented 3 weeks ago

Prework

[x ] Read and agree to the Contributor Code of Conduct and contributing guidelines.
[x ] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[x ] New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a GitHub discussion.
[x ] Format your code according to the tidyverse style guide.

Proposal

Nextflow has a very popular option that if your job fails for not-enough resources, it relaunches the job with more resources. Is it possible to have it in the target/crew echosystsem? Or the design does not allow it? It would be a game changer.

wlandau commented 3 weeks ago

This might be possible. Development crew can detect the number of consecutive times a worker exits without completing all its tasks. Resources are specified differently in each plugin, so a request for more would need to be handled differently for each one.

Which resources does Nextflow increase? And by how much each time?

stemangiola commented 3 weeks ago

This is a good post explaining it

https://lucacozzuto.medium.com/handling-failing-jobs-with-nextflow-24405b97b679

The way they do it is pretty elegant

process {
  queue = 'queu_name'
  memory='12G'
  container = 'my_container'
  withLabel: big_time_cpus {
    memory = '45G'
        time = '48h'
        cpus = 20
  }
  withLabel: retry_increasing_mem {
       errorStrategy = 'retry'
       memory = {3.GB * task.attempt}
       cpus = 1
       time = {6.h * task.attempt}
       maxRetries = { task.exitStatus == 140 ? 4 : 1 }
  } 
}

Although I would prefer (for my work) to specify something like (for three attempts)

mem_Gb = c(5, 20, 200) time_h = c(0.5, 2, 12)

As the size of my data is exponentially distributed (many small datasets, and few huge ones)

With the task_attempt formula, could be replicate as such

5 * (4 ^ (task.attempt - 1))

This would almost eliminate the need for resource tiering (although it's cool) I developed for HPCell

wlandau commented 3 weeks ago

This feature turned out to be much easier to implement than I originally thought. Added in https://github.com/wlandau/crew.cluster/pull/49