Skip to content

Nextflow

Thomas edited this page May 10, 2024 · 22 revisions

Learning Nextflow

A complete 10 hour workshop on learning Nextflow has been digitised by Seqera Labs: the videos and the online resources.

The lab's full set of tutorial's for Nextflow are available here but these remain a work in progress. We had a workshop in Feb 2020 and we kept the discussion and logs of this in a slack channel #nextflow-workshop... take a look on there for example scripts and relevant links.

Some tutorial information for Nextflow from a workshop at the Sanger is available here. An active Nextflow chatroom where you can ask questions is on gitter.

Quick example

This example scripts shows how to launch an Rscript from Nextflow in parallel on the cluster:

#!/usr/bin/env nextflow
params.datasets = ['iris', 'mtcars']
process writeDataset {
    executor = 'pbspro'
    clusterOptions = '-lselect=1:ncpus=1:mem=1Gb -l walltime=24:00:00 -V'
    tag "${dataset}"
    publishDir "$baseDir/data/", mode: 'copy', overwrite: false, pattern: "*.tsv"
    input:
    each dataset from params.datasets
    output:
    file '*.tsv' into datasets_ch
    """
    module load R
    """
    """
    #!/usr/bin/env Rscript 
    data("${dataset}")
    write.table(${dataset}, file = "${dataset}.tsv", sep = "\t", col.names = TRUE, row.names = FALSE)
    """
}

nf-core

nf-core is an collection of standardised Nextflow pipelines for analyzing genomics data. These are intended to be extremely robust and reproducible such that they can be run in any computing environment. While this is a noble ideal, there are often some additional steps that need to be taken to get these pipelines (or other Nextflow pipelines) working.

Here is an example script using the nf-core/cutandrun pipeline. The config file being referenced can be downloaded here: hpc_config5.txt

#!/bin/bash  

# Resources 
## [nf-core/cutandrun documentation](https://nf-co.re/cutandrun/dev/usage)
## [Neurogenomics Lab Wiki: Nextflow](https://github.com/neurogenomics/labwiki/wiki/Nextflow)

## Set up Java
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12/bin:$PATH
export JAVA_HOME=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12

## Set up Nextflow
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/nextflow-21.10.6.5660:$PATH 
export NXF_VER=21.10.6

## Pull docker container 
# https://hub.docker.com/r/nfcore/cutandrun
# Run to get a local copy of the image (atacseq_latest.sif).
# You only need to do this step once, unless you want to update the image. 
# singularity pull docker://nfcore/cutandrun:dev

# Load NF-Tower credentials as global variables
source ~/.nftower  

export project_id=phase_1_06_apr_2022
export repo_dir=$HOME/neurogenomics/Data/tip_seq
export outdir=$repo_dir/processed_data/$project_id
mkdir -p $outdir
mkdir -p /rds/general/user/$USER/ephemeral/tmp/


nextflow run nf-core/cutandrun \
 --input $repo_dir/raw_data/scTIP-seq/$project_id/design.csv \
 --genome GRCh38 \
 --outdir $outdir \
 -with-tower \
 -with-singularity $repo_dir/cutandrun_dev.sif \
 -c $repo_dir/hpc_config5.txt\
 -profile imperial\
 -r dev\
 --igg_control false

NXF_VER variable

Note that the NXF_VER global variable is set inside this script. This allows you to switch back and forth between older and newer versions of nextflow using the same executable, so make sure it's set to the version you need for the particular pipeline you're running.

Nextflow on Imperial HPC

50-job limit

Imperial HPC limits the number of jobs you can run concurrently to 50. Therefore, you need to tell Nextflow not to submit more than 50 jobs at a time. This is already implemented in the imperial and imperial_mb profiles.

However, if you're not using these profiles, you'll need to add the following to your custom config. Here we specify 49 to give us some extra room in case we want to run one other independent job:

executor {
  $pbspro {
    queueSize = 49
  } 
}

Make sure you've ended previous jobs

Ending Nextflow runs early (intentionally or due to lost connection to HPC) does not necessarily mean that the jobs it submitted to HPC have been cancelled. This means that if you try to run Nextflow again after an interrupted attempt, you may end up having more than the max allowed job submissions to HPC at once (>50 jobs).

Therefore, you must first make sure your other jobs are deleted first, using the steps below. Not performing these steps first can cause subsequent Nextflow runs to become incredibly slow and eventually crash with the error Limit of 50 concurrent jobs reached, even when you have set the queueSize = 49 arg in your custom config.

  1. View info on your current job submissions:
qstat
  1. (a) Gather all running jobs and delete them:
oldjobs=`qselect -u $USER`
  1. (b) Alternatively, you can be more selective and search for a substring in the qstat output ("nf-NFCORE" in this case):
oldjobs=`qstat -t | column -t | grep "nf-NFCORE" |  awk 'NR>2 {print $1}'`
  1. Check which jobs you're selecting first
echo $oldjobs
  1. Once you're sure you want to delete these, delete them:
$oldjobs | xargs qdel

Setup

There are two ways you can try to run Nextflow on HPC: (a) using pre-installed tools, or (b) creating a conda environment.

(a) neurogenomics/Tools

The Neurogenomics Lab has a number of pre-downloaded software on Imperial HPC that can be used as a shared resource. To use them, run the following export commands (see here for more details), or simply add them to your ~/.bashrc file so that they will be loaded automatically the next time you log in.

  1. DO NOT module load nextflow, as this version of Nextflow is outdated and can cause conflicts.
  2. Load HPC's installation of gcc:
module load gcc/8.2.0
  1. Adjust the memory limits on Nextflow by setting the global variable NXF_OPTS:
NXF_OPTS='-Xms1g -Xmx4g'
  1. Load the updated version of Java (not the outdated version that is the default on HPC):
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12/bin:$PATH
export JAVA_HOME=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12
  1. Load the updated version of Nextflow (not the outdated version that is the default on HPC):
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/nextflow-21.04.3.5560:$PATH

(b) Conda environment

Alternatively, you can create a conda environment using this yaml file (see here for more details).

conda env create -f https://github.com/bschilder/scKirby/raw/main/inst/conda/nfcore.yml
conda activate nfcore

Configs and profiles

Configuration files ("configs") can be used to adjust settings and parameters to Nextflow as well as specific steps in each part of your pipeline.

Nextflow can use remotely-stored configuration files (i.e. "profiles") designed to allow Nextflow to run on different institutional computing environments.

Imperial HPC profiles

Imperial has two institutional profiles, imperial and imperial_mb, both created by Combiz Khozoie. That said, these profiles may not work for all situations, so if you notice an issue using these on Imperial HPC please do submit a pull request with a suggested fix or reach out to Combiz.

Usage

You can pass profiles to Nextflow in several ways.

  1. -profile: Perhaps the best way to use profiles with nf-core pipelines is to pass imperial or imperial_mb (for users with access to the MEDBIO partition of Imperial HPC) to the -profile argument. For example:
nextflow run nf-core/atacseq <other_arguments> -profile imperial
  1. -c: A local version of the config file can be passed to -c:
nextflow run nf-core/atacseq <other_arguments>  -c /path/to/your/custom.config
  1. Place the config file in $HOME/.nextflow, as a file named "config" where it will automatically be recognized and imported by Nextflow. That said, be careful using it this way, since you might forget about it and (depending on what you put in it) it could cause problems with other Nextflow pipelines!

Custom config

nf-core pipelines label each process according to its computational requirements: "process_low", "process_medium", "process_high", "process_long"

If you're already using a imperial or imperial_mb profile and you still get the following error:

qsub: 
   Job resource selection does not match any permitted configuration.
   Please review the job sizing guidance on:
   https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/computing/

...it means there are some problems with how Nextflow is trying to scale up the resources requested for a given job (which uses the check_max function that comes with most nf-core pipelines). You can avoid this by adding the following to your custom config file and passing it to Nextflow.

NOTE: This approach isn't ideal since it uses more resources than necessary (rather than flexibly scaling), but can be used as a solution until we figure out a better one.

process { 
        withLabel:process_low {
                cpus = 2
                memory = 12.GB
                time = 6.h
        } 
        withLabel:process_medium {
                cpus = 32
                memory = 62.GB
                time = 72.h
         }
        withLabel:process_high {
                cpus = 32
                memory = 62.GB
                 time = 72.h
        }
        withLabel:process_long {
                time = 72.h
         } 
        // Change a specific process by name
        withName:TRIMGALORE {
            cpus = 32
            memory = 62.GB
            time = 72.h
         }
}

Note that if your pipeline used to work but now fails, the HPC queue classes may have changed without an update to the imperial or imperial_mb profile (or someone has introduced a bug when editing the profile). For example, as of 10/05/2024, the v1_throughput72 queue is limited to 1 node per job, 1-8 CPU cores per job, 1-100 GB of memory and a wall-clock time of 9-72 hours. If your config file contained the chunk

withLabel:process_low {
        queue = 'v1_throughput72'
        cpus   = 2 
        memory = 48.GB
        time   = 8.h
    }

any process with the label process_low would fail at submission (with the above error) because the specified time does not meet the queue's minimum wall-clock time. In this scenario, changing the line time = 8.h to time = 9.h would resolve the error.

Order of Nextflow configs/profiles matters!

-c and -profile

There's a lot of different ways you can pass configs and/or profiles to Nextflow. The way and order in which you provide can have different effects, so be aware.

If multiple profiles are specified (e.g. -profile singularity,Imperial) Nextflow will give precedence to the first profile over the second (if they have an overlapping parameters). So in this example,singularity would overwrite any overlapping parameters specified in imperial.

If you specify both -c <custom_config> and -profile <institutional_profile>, -c will always overwrite -profile (regardless of the order in which you specify -c and -profile arguments).

-C

Lastly, you can also supply uppercase -C, which overrides and ignores all other configs/profiles except the one supplied to -C:

nextflow -C custom.config run example.nf

Note that unlike the previous examples, we're placing the -C argument right after nextflow (as opposed to after run like lowercase -c). -C can only be used in this way.

Here's a 15 minutes video about configs.

The Nextflow work directory

The Nextflow work directory contains intermediate files and logs for each pipeline process. These files are organised into hash-named subdirectories and are useful for debugging and when you want to resume a pipeline (e.g., following an error). However, this directory is not cleaned automatically when the pipeline successfully finishes and can, therefore, become extremely bloated. You should regularly delete this directory to avoid eating into the lab's storage quota!

Singularity

Using a Singularity container is the best way to get your Nextflow pipeline working reliably in different computing environments, as it contains all of the necessary software (with the correct versions) to run the pipeline. They can even be used to run Docker (another container application that isn't usually supported on HPCs).

All nf-core pipelines can be used with Singularity by specifying -profile singularity. They can also be found on DockerHub.

Downloading the singularity image in advance

On Imperial HPC, Nextflow pipelines often don't work if you haven't downloaded the singularity image in advance. Instead, they may hang indefinitely when the image is being downloaded:

2020/12/14 18:19:52 info unpack layer: sha256:3d98cc5df0f94e423e4855ac17be437fe31d30a5827a
INFO:  Creating SIF file...

Therefore, it is usually better to download the singularity image outside of the pipeline and save in the same dir as the cacheDir path for the singularity option in the custom config file.

Connection issues

If your connection to Imperial HPC is lost at some point while running a Nextflow pipeline, your pipeline will stop (though the jobs that were already submitted will keep running) because Nextflow runs on the login node by default (it only submits jobs to other nodes). You can think of the Nextflow software as a construction foreman, sending off his workers to do different jobs. If the foreman disappear, no new jobs can be assigned and the whole pipeline stops.

This can be a problem when you have unstable internet, or HPC is having connection issues that day. Fortunately, there are several solutions to this.

(a) tmux / screen

tmux and screen are tools that come preinstalled on HPC and can be used to start "sessions". You can read more about how this works, but the main thing to know is that this means if you lose connection midway through an analysis, all processes will continue running without you. Once you do log back into HPC, you can resume this session and continue as if you never lost connection.

Here's an example using tmux (cheatsheet here):

# Start a new session called "rnaseq"
tmux new -s rnaseq
### Connection gets lost ###
### Reconnect ####
# Resume the session called "rnaseq"
tmux a -t rnaseq

NOTE: A drawback to this approach is that your screen will get cut off so that you can only see a little bit of the pipeline output at a time (and you can't scroll up to see old output). This means that not all messages your pipeline produces (which can be very useful for monitoring progress/warnings) will be visible to you at any given time. However there are some workarounds to this described here. Nextflow also generates log files which you can view:

# See what hidden log files are available
ls -lah
cat .nextflow.log

However, it is possible to customise tmux screen settings so that mouse scrolling is enabled and more of the old output is displayed. These settings are stored in a .tmux.conf file that should be stored in your home directory. Here's an example of a configuration file that enables scrolling and increases the limit of output display:

set-option -g activity-action other
set-option -g assume-paste-time 1
set-option -g base-index 0
set-option -g bell-action any
set-option -g default-command ""
set-option -g default-shell "/bin/bash"
set-option -g destroy-unattached off
set-option -g detach-on-destroy on
set-option -g display-panes-active-colour red
set-option -g display-panes-colour blue
set-option -g display-panes-time 1000
set-option -g display-time 750
set-option -g history-limit 5000
set-option -g key-table "root"
set-option -g lock-after-time 0
set-option -g lock-command "lock -np"
set-option -g message-command-style fg=yellow,bg=black
set-option -g message-style fg=black,bg=yellow
set-option -g mouse on
set-option -g prefix C-b
set-option -g prefix2 None
set-option -g renumber-windows off
set-option -g repeat-time 500
set-option -g set-titles off
set-option -g set-titles-string "#S:#I:#W - \"#T\" #{session_alerts}"
set-option -g silence-action other
set-option -g status on
set-option -g status-interval 15
set-option -g status-justify left
set-option -g status-keys emacs
set-option -g status-left "[#S] "
set-option -g status-left-length 10
set-option -g status-left-style default
set-option -g status-position bottom
set-option -g status-right " \"#{=21:pane_title}\" %H:%M %d-%b-%y"
set-option -g status-right-length 40
set-option -g status-right-style default
set-option -g status-style fg=black,bg=green
set-option -g update-environment[0] "DISPLAY"
set-option -g update-environment[1] "SSH_ASKPASS"
set-option -g update-environment[2] "SSH_AUTH_SOCK"
set-option -g update-environment[3] "SSH_AGENT_PID"
set-option -g update-environment[4] "SSH_CONNECTION"
set-option -g update-environment[5] "WINDOWID"
set-option -g update-environment[6] "XAUTHORITY"
set-option -g visual-activity off
set-option -g visual-bell off
set-option -g visual-silence off
set-option -g word-separators " -_@"

This file can also be created with command tmux show -g > ~/.tmux.conf. This command prints the default tmux settings and saves them to a configuration file. After creating the file, modify the relevant settings in the configuration file. In this case, mouse scrolling can be enabled with set-option -g mouse on (default is off) and the history limit is increased with line set-option -g history-limit 5000. For these settings to be updated, you can either close all active tmux sessions and relaunch it or run command tmux source-file ~/.tmux.conf. There are multiple other ways to customise tmux and make it more user-friendly, and there are plenty of tutorials on how to make tmux more user-friendly (e.g. https://linuxhint.com/customize-tmux-configuration/, https://www.hamvocke.com/blog/a-guide-to-customizing-your-tmux-conf/).

(b) qsub

Another solution is to create a [.pbs] script as a master job to run Nextflow. After you submit this to HPC, the master job will run and Nextflow will then take care of all other job submissions (regardless of your connection). During this time, you can still use qstat to check the status of all of your Nextflow processes.

NOTE: A drawback to this approach is that your pipeline can only run as long as the walltime you request (set to 30 minutes in the example below). Request longer walltime if you expect your Nextflow pipeline to take longer. Another drawback is that you can't see any of the output messages Nextflow produces, unless you check the log file (see tmux example above).

  1. Create a master job submission file (.pbs), to run the Nextflow script example.nf e.g.:
#PBS -l walltime=00:30:00
#PBS -l select=1:ncpus=1:mem=2gb

module load gcc/8.2.0
NXF_OPTS='-Xms1g -Xmx4g'
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12/bin:$PATH
export JAVA_HOME=/rds/general/project/neurogenomics-lab/live/Tools/jdk-11.0.12
export PATH=/rds/general/project/neurogenomics-lab/live/Tools/nextflow-21.04.3.5560:$PATH 

nextflow run example.nf 
  1. Submit the master job:
qsub master_job.pbs

Nextflow on the cloud

NextFlow on Google Cloud Life Sciences Platform (GCP)

Follow the guide here.

NextFlow on Amazon Web Services (AWS)

This repo explains how to do it: Terraform template to create AWS resources to execute jobs using nextflow

Clone this wiki locally