Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make batch jobs scale on failure #8

Open
jashapiro opened this issue Aug 20, 2020 · 2 comments
Open

Make batch jobs scale on failure #8

jashapiro opened this issue Aug 20, 2020 · 2 comments

Comments

@jashapiro
Copy link
Member

Looking at some of the code on nextflow-core for inspiration & best practices, I noticed that they often use code for memory usage declarations in the config file that looks something like this:

https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/conf/base.config#L15

The idea being that if the first attempt fails for OOM, add memory and try again.

We can also set some error handling to make this specific to OOM errors and limit the number of retries

  errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'terminate' }
  maxRetries = 1

https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/conf/base.config#L18-L19

We also have the ability to tag individual processes by resource requirements, like this, which seems like it will be a good idea:

 withLabel: low_memory {
    memory = { check_max( 16.GB * task.attempt, 'memory' ) }
  }
  withLabel: mid_memory {
    cpus = { check_max (4, 'cpus')}
    memory = { check_max( 28.GB * task.attempt, 'memory' ) }
    time = { check_max( 8.h * task.attempt, 'time' ) }
  }
  withLabel: high_memory {
    cpus = { check_max (10, 'cpus')}
    memory = { check_max( 70.GB * task.attempt, 'memory' ) }
    time = { check_max( 8.h * task.attempt, 'time' ) }
  }

https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/conf/base.config#L23-L35

checkmax() is defined here: https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/nextflow.config#L170-L172

@jashapiro
Copy link
Member Author

An initial implementation of this appears in #17, but this should be moved to the config file and label implementation before closing.

@jashapiro
Copy link
Member Author

This is now in the config file, but not all workflows use it. Standardizing this should happen before closing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant