This repository hosts an advanced pipeline build with Nextflow for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data of bacterial genomes. It is designed to offer an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.
The pipeline includes the following steps:
- Quality Control: Assessment of raw sequencing data using FastQC to evaluate read quality. Removal of low-quality bases and adapter sequences with FastP followed again by FastQC and MultiQC to summarise the input data.
At this point, the two modes available in the pipeline differ on the input reference genome. You can perform the variant calling using a de novo assembled reference strains or an already available reference genome.
-
The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh
After inputed the reference genome, the pipeline follows the same steps for both modes:
-
Alignment: Alignment against the selected reference genome with BWA-MEM and samtools.
-
Quality control: Alignment quality control using QUAST.
-
Aggregation of quality reports: MultiQC
-
Variant calling and filtering:
-
Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
-
Variant Filtering: Application of quality filters to obtain high-confidence variant calls (see Parameters).
-
Genetic variant annotation: Using SnpEff, a toolbox for annotating and predicting the functional effects of genetic variants on genes and proteins.
-
-
Post-Alignment Analysis:
The prerequisites to run the pipeline are:
- Install Nextflow
- Install Docker or Singularity for container support
- Ensure Java 8 or higher is installed
Clone the Repository:
# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git
# Move in it
cd WGS-Analysis-VariantCalling
conda create -n bacteriano -f enviromentWGS.yaml
conda activate bacteriano
Run the pipeline using the following command, adjusting the parameters as needed:
DE NOVO
Important
The name of the paired-end reads of the reference sample must be labelled as 1 (e.g. AB1_1.fastq.gz / AB1_2.fastq.gz)
nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>
REFERENCE GENOME
nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>
--mode: Depends on the analysis novo/reference/conda.
--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).
--outdir: Directory where the results will be stored (default: out).
-profile: Specifies the execution profile (docker, singularity or local).
--genome_name_db (only for --mode novo): Name of the organism that will name the databse in SnpEFF.
--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.
-w: Path to the temporary work directory where files will be stored (default: ./work).
--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15
--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20
--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20
--length_required: reads shorter than length_required will be discarded. Default: 50.
--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".
--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".
Note
QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.
In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays
Recommendations for clinical interpretation of variants found in non-coding regions of the genome
An ANI gap within bacterial species that advances the definitions of intra-species units
Evaluation of serverless computing for scalable execution of a joint variant calling workflow
Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing