Skip to content

Pipeline

Louis-Mael Gueguen edited this page Mar 1, 2023 · 8 revisions

Overview

General view of the pipeline, decomposed in 3 steps:

For a more detailed explenation of the different steps, refer to the Report section

Lighter

Lighter is read correction tool based on kmer spectrums. It randomly samples reads, from which it gets solid kmers (ones that pass a threshold test). A bloom filter corresponding to the solid kmers is computed. Then, it corrects untrusted kmers into trusted ones.

Bcalm2

Bcalm2 is tool designed to build De Bruijn graphs (DBG). In our pipeline, Bcalm2 is used to build one DBG per sample, using all kmers, to obtain the unitigs (non-ambiguous paths in the DBG). Then, builds a global DBG with a frequency threshold to ignore the rarest kmers and extracts the unitigs. This step attempts to keep all kmers that would be rare in individual files but present in most (thus not sequencing errors) while ignoring kmers that are few in the whole dataset (most probably sequencing errors).

REINDEER

REINDEER is a tool designed to index large datasets and retrieve abundances of sequences. In the pipeline, its role is to index the unitigs obtained for each sample. Then, it retrieves the abundances of the unitigs found at the global level in each sample. The output is a matrix which rows are unitigs from the global level and the columns are the samples. Each cell is the abundance of a unitig in a sample.

Modified DBGWAS

DBGWAS has been modified to take REINDEER output matrix as input, as well as use a standard format to represent a DBG (nodes and edges files). The modified DBGWAS applies a linear model to the matrix and finds unitigs significantly associated with the phenotype of interest. The output is an interactive visualisation of the results with corrected p-value of the statistical test, a subgraph of the unitigs and their immediate neighbours, and annotations.

Clone this wiki locally