Skip to content

This is the repository of the pipeline META-DIFF, which detects sequences in differential abundances between two conditions, and annotates them taxonomicaly and functionaly.

Notifications You must be signed in to change notification settings

Louis-MG/META-DIFF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

META-DIFF 🦠

This is the repository of the pipeline META-DIFF, which detects differentially abundant sequences between two conditions, and annotates them taxonomically and functionally.

Motivation

Metagenomics becomes increasingly important in building our knowledge about microbes. Links between microbiome perturbations and diseases are regularly uncovered. Using kmer-based methods, this pipeline allows its users to quickly find microbial DNA sequences in differential abundances between two conditions (e.g. healthy and not healthy), and annotates them taxonomically and functionally. The pipeline also builds several machine-learning models, optimizes hyper-parameters to get the best one, and then calculates the contribution of each feature used.

❗ Important note: the pipeline can predict genes and annotate functions for prokaryotes only. ❗

Workflow

The workflow is described by the following figure :

Schematic of the META-DIFF pipeline

wiki ! 📚

Output

The output is divided in several key files:

  • kraken case/control taxonomic assignment output and report
  • case/control summary of the taxonomic assignment, with taxa (all levels) ordered by number of base assigned
Unclassified    512640685
Bacteroides (taxid 816) 31310507
Prevotella copri (taxid 165179) 15327700
Phocaeicola plebeius (taxid 310297)     14988657
Eubacteriales (taxid 186802)    14963744
Enterobacteriaceae (taxid 543)  12643762
Bacteroidales (taxid 171549)    11334934
Klebsiella (taxid 570)  10345553
Faecalibacterium prausnitzii (taxid 853)        10234613
Bacteria (taxid 2)      10146249
  • case/control functional annotation in the form of a barplot, table and a heatmap of pathways detected:

Heatmap

  • Machine-learning models and their performance (as well as feature selection):

Heatmap

  • table of unitigs to functions by condition. Each unitig is linked to the genes it contains and their function, KO number.
Gene ID Translated Gene seq Unitig ID Unitig seq Gene function KO CLade
Gene1 ARDENE Unitig1 ACGTCGCT Glucose transferase K00001 Bacteroides
Gene1 WPH Unitig2 ACGTCGCT Protease K00004 P. plebeius
Gene2 IFPSY Unitig1 GTCGATCATG Oxydase K00761 E. coli

Requirements

Check the wiki ! Ain't much, but it's honest work. Memory (RAM) needed will depend on the size of your alignment database. Disk space required mostly depends on the size of your dataset and databases. The number of kmers for 3To of CRC fasta files reached hundreds of millions, which is about 500G of fasta files for the first step. Other steps will use less disk. The database of MicrobeAnnotator is about 690G.

Installation

Clone the repository:

git clone https://github.com/Louis-MG/META-DIFF.git

Get your functional database ready by following instructions at MicrobeAnnotator. Don't worry, it's just a few lines that take a while. Recommandation is the full database, and the pipeline uses the Diamond search. Copy the path to the MicrobeAnnotator_DB in the snakemake/config.yaml file:

microbeannotator_db_path: "/path/to/MicrobeAnnotator_DB/"

Get a Kraken2 DB ready by checking the instructions at Kraken2. Copy the path to the snakemake/config.yaml file:

kraken_database_path: "/path/to/krakenDB/db_name"

Usage

Build a file-of-files yourself or using the provided script:

bash kmdiff_fof_prep.sh --help

This script generates the file of file (fof.txt) for kmdiff. Its arguments are:
	--cases -c <PATH> path to the directory of cases samples.
	--controls -C <PATH> path to the directory of control samples.
	--output -o <PATH> path to where the fof should be.
	-R1 <STRING> -R2 <STRING> strings to determine forward and reverse reads.
	--help -h displays this help message and exits.

EX: bash kmdiff_fof_prep.txt --cases /path/to/cases/ --controls /path/to/controls/ --output /path/to/output/ -R1 _R1 -R1 _R2

Output is :
	- a fof.txt named after the output parameter. The file is tab separated, format:
		control1: /path/to/control1_read1.fastq ; /path/to/control1_read2.fastq
		control2: /path/to/control2_read1.fastq ; /path/to/control2_read2.fastq
		case1: /path/to/case1_read1.fastq ; /path/to/case1_read2.fastq
		case2: /path/to/case2_read1.fastq ; /path/to/case2_read2.fastq

Add the last paths to ./snakemake/config.yaml:

# path to this repo, "META-DIFF/" included:
src_path: "/path/to/META-DIFF/"
#kraken db path:
kraken_database_path: "/path/to/db"
#microbeannotator db path:
microbeannotator_db_path: "/path/to/db"
# where your results will be
project_path: "/path/to/your/project/"
# path to your file of file:
fof: "/path/to/your/fof.txt"

Finally, ssssstart the pipeline 🐍:

snakemake --cores X --use-conda

Coming next

  • addition of a quarto report with all the numbers and figures
  • addition of a script to automatically set up the databases

Issues

If you have any issues, let me know in the Issues space, with an informative title and description.

Citations

Coming soon ! 🎓

About

This is the repository of the pipeline META-DIFF, which detects sequences in differential abundances between two conditions, and annotates them taxonomicaly and functionaly.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published