Align Paired-End Reads To Reference

The nextflow script will take any number of paired reads in FASTQ format and output sorted BAM alignments reads using bowtie2 or bwa-mem2.

Dependencies (version tested)

  • Nextflow (23.10.1)
  • Java (
  • Python (3.10)
  • bowtie2 (2.5.3)
  • bwa-mem2 (2.2.1)
  • SAMtools (1.19.2)

Conda Environment

Create environment using conda:
conda env create -f ./nextflow-pipelines/env/align.yml

Create environment using mamba (faster):
mamba env create -f ./nextflow-pipelines/env/align.yml

Activate conda environment:
mamba activate align or conda activate align or source activate align



# Activate conda environment
mamba activate align

# Variables

# Index genome
# bowtie2-build ${genome} ${genome}
# bwa-mem2 index ${genome}

# Run pipeline
nextflow run ~/nextflow-pipelines/src/ \
    --reads ${reads} \
    --suffix "_{1,2}.fq.gz" \
    --reference ${genome} \
    --outdir ${outdir} \
    --aligner "bowtie2" \
    --filter "-F 4" \
    --cpus ${cpus}
Parameters      Description
--reads path to input directory containing FASTQ files
--suffix string denoting the suffix after a sample name and the forward (read1) and reverse (read2) designation (e.g. for read pair sample_1.fq.gz and sample_2.fq.gz set the parameter to --suffix "_{1,2}.fq.gz". The name of this BAM file will be called sample.bam)
--reference path to reference FASTA file (e.g. reference genome)
--outdir path to output directory
--aligner string denoting whether to use bowtie2 "bowtie2" or bwa-mem2 "bwa-mem2" for alignments (default: "bowtie2")
--bowtie2 string of additional arguments passed to bowtie2 (e.g. --bowtie2 "--sensitive --seed 123")
--bwamem2 string of additional arguments passed to bwa-mem2 (e.g. --bwamem2 "-M -k 19")
--filter string of additional arguments passed to samtools to filter BAMs (e.g. to keep only primary alignments: --filter "-F 260", default is to remove unmapped reads: --filter "-F 4")
--indexbam index BAMs using samtools index (optional)
--test prints out a tuple of the sample ID and paths to the input paired reads (dry run)
--cpus integer denoting the number of cpus (default: 16)


$ ls input/
SampleID_01_1.fq.gz SampleID_02_1.fq.gz
SampleID_01_2.fq.gz SampleID_02_2.fq.gz


$ ls output/
SampleID_01.bam SampleID_02.bam