Skip to content

Nextflow pipeline for whole-genome sequencing (WGS) analysis and variant calling in bacterial genomes using Illumina data, supporting de novo assembly and reference-based analysis.

License

Notifications You must be signed in to change notification settings

AMRmicrobiology/WGS-Analysis-VariantCalling

Repository files navigation

WGS-Analysis-VariantCalling

Contributors Forks Stargazers Issues license-shield

Introduction

This repository hosts an advanced pipeline build with Nextflow for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data of bacterial genomes. It is designed to offer an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.

Current pipeline of the project

Contents

Pipeline summary:

The pipeline includes the following steps:

  1. Quality Control: Assessment of raw sequencing data using FastQC to evaluate read quality. Removal of low-quality bases and adapter sequences with FastP followed again by FastQC and MultiQC to summarise the input data.

At this point, the two modes available in the pipeline differ on the input reference genome. You can perform the variant calling using a de novo assembled reference strains or an already available reference genome.

  • De-novo

    • Assembly: After quality control as previously described, de novo assembly using SPAdes.
    • Quality assembly assessment: Structural quality metrics of the assembly using QUAST and evaluation of biological completeness with BUSCO.
    • Anotation: Genome anotation using Prokka and Bakta.
  • Reference genome

    The pipeline includes an script to download the reads from DB using an Acc_List.txt

    bash ./workflow/bin/download_reads.sh
    

After inputed the reference genome, the pipeline follows the same steps for both modes:

  1. Alignment: Alignment against the selected reference genome with BWA-MEM and samtools.

  2. Quality control: Alignment quality control using QUAST.

  3. Aggregation of quality reports: MultiQC

  4. Variant calling and filtering:

    • Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.

    • Variant Filtering: Application of quality filters to obtain high-confidence variant calls (see Parameters).

    • Genetic variant annotation: Using SnpEff, a toolbox for annotating and predicting the functional effects of genetic variants on genes and proteins.

  5. Post-Alignment Analysis:

    • Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.

    • Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

Installation

The prerequisites to run the pipeline are:

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git

# Move in it
cd WGS-Analysis-VariantCalling

Local (conda)

conda create -n bacteriano -f enviromentWGS.yaml
conda activate bacteriano

How to use it?

Run the pipeline using the following command, adjusting the parameters as needed:

DE NOVO

Important

The name of the paired-end reads of the reference sample must be labelled as 1 (e.g. AB1_1.fastq.gz / AB1_2.fastq.gz)

nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>

REFERENCE GENOME

nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>

Parameters

--mode: Depends on the analysis novo/reference/conda.

--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).

--outdir: Directory where the results will be stored (default: out).

-profile: Specifies the execution profile (docker, singularity or local).

--genome_name_db (only for --mode novo): Name of the organism that will name the databse in SnpEFF.

--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.

Optional parameters

-w: Path to the temporary work directory where files will be stored (default: ./work).

Trimming

--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15

--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20

--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20

--length_required: reads shorter than length_required will be discarded. Default: 50.

Filter

--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".

--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".

Note

QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.

Reference:

In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays

Recommendations for clinical interpretation of variants found in non-coding regions of the genome

An ANI gap within bacterial species that advances the definitions of intra-species units

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing

About

Nextflow pipeline for whole-genome sequencing (WGS) analysis and variant calling in bacterial genomes using Illumina data, supporting de novo assembly and reference-based analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published