Closed stemangiola closed 3 weeks ago
This might be possible. Development crew
can detect the number of consecutive times a worker exits without completing all its tasks. Resources are specified differently in each plugin, so a request for more would need to be handled differently for each one.
Which resources does Nextflow increase? And by how much each time?
This is a good post explaining it
https://lucacozzuto.medium.com/handling-failing-jobs-with-nextflow-24405b97b679
The way they do it is pretty elegant
process {
queue = 'queu_name'
memory='12G'
container = 'my_container'
withLabel: big_time_cpus {
memory = '45G'
time = '48h'
cpus = 20
}
withLabel: retry_increasing_mem {
errorStrategy = 'retry'
memory = {3.GB * task.attempt}
cpus = 1
time = {6.h * task.attempt}
maxRetries = { task.exitStatus == 140 ? 4 : 1 }
}
}
Although I would prefer (for my work) to specify something like (for three attempts)
mem_Gb = c(5, 20, 200) time_h = c(0.5, 2, 12)
As the size of my data is exponentially distributed (many small datasets, and few huge ones)
With the task_attempt formula, could be replicate as such
5 * (4 ^ (task.attempt - 1))
This would almost eliminate the need for resource tiering (although it's cool) I developed for HPCell
This feature turned out to be much easier to implement than I originally thought. Added in https://github.com/wlandau/crew.cluster/pull/49
Prework
Proposal
Nextflow has a very popular option that if your job fails for not-enough resources, it relaunches the job with more resources. Is it possible to have it in the target/crew echosystsem? Or the design does not allow it? It would be a game changer.