WGS-Analysis-VariantCalling

Introduction

This repository hosts an advanced pipeline build with Nextflow for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data of bacterial genomes. It is designed to offer an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.

Pipeline summary:

The pipeline includes the following steps:

Quality Control: Assessment of raw sequencing data using FastQC to evaluate read quality. Removal of low-quality bases and adapter sequences with FastP followed again by FastQC and MultiQC to summarise the input data.

At this point, the two modes available in the pipeline differ on the input reference genome. You can perform the variant calling using a de novo assembled reference strains or an already available reference genome.

De-novo
- Assembly: After quality control as previously described, de novo assembly using SPAdes.
- Quality assembly assessment: Structural quality metrics of the assembly using QUAST and evaluation of biological completeness with BUSCO.
- Anotation: Genome anotation using Prokka and Bakta.
Reference genome

The pipeline includes an script to download the reads from DB using an Acc_List.txt
```
bash ./workflow/bin/download_reads.sh
```

After inputed the reference genome, the pipeline follows the same steps for both modes:

Alignment: Alignment against the selected reference genome with BWA-MEM and samtools.
Quality control: Alignment quality control using QUAST.
Aggregation of quality reports: MultiQC
Variant calling and filtering:
- Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
- Variant Filtering: Application of quality filters to obtain high-confidence variant calls (see Parameters).
- Genetic variant annotation: Using SnpEff, a toolbox for annotating and predicting the functional effects of genetic variants on genes and proteins.
Post-Alignment Analysis:
- Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.
- Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

Installation

The prerequisites to run the pipeline are:

Install Nextflow
Install Docker or Singularity for container support
Ensure Java 8 or higher is installed

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git

# Move in it
cd WGS-Analysis-VariantCalling

Local (conda)

conda create -n bacteriano -f enviromentWGS.yaml
conda activate bacteriano

How to use it?

Run the pipeline using the following command, adjusting the parameters as needed:

DE NOVO

Important

The name of the paired-end reads of the reference sample must be labelled as 1 (e.g. AB1_1.fastq.gz / AB1_2.fastq.gz)

nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>

REFERENCE GENOME

nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>

Parameters

--mode: Depends on the analysis novo/reference/conda.

--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).

--outdir: Directory where the results will be stored (default: out).

-profile: Specifies the execution profile (docker, singularity or local).

--genome_name_db (only for --mode novo): Name of the organism that will name the databse in SnpEFF.

--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.

Optional parameters

-w: Path to the temporary work directory where files will be stored (default: ./work).

Trimming

--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15

--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20

--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20

--length_required: reads shorter than length_required will be discarded. Default: 50.

Filter

--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".

--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".

Note

QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.

Reference:

In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays

Recommendations for clinical interpretation of variants found in non-coding regions of the genome

An ANI gap within bacterial species that advances the definitions of intra-species units

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
bin		bin
data		data
subworkflow		subworkflow
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PipelineCP_V2.0.png		PipelineCP_V2.0.png
README.md		README.md
enviromentWGS.yaml		enviromentWGS.yaml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

De-novo

Reference genome

Installation

Local (conda)

How to use it?

Parameters

Optional parameters

Trimming

Filter

Reference:

About

Releases

Packages

Contributors 2

Languages

License

AMRmicrobiology/WGS-Analysis-VariantCalling

Folders and files

Latest commit

History

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

De-novo

Reference genome

Installation

Local (conda)

How to use it?

Parameters

Optional parameters

Trimming

Filter

Reference:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages