Name	Name	Last commit message	Last commit date
parent directory ..
config	config
envs	envs
scripts	scripts
CHANGELOG.md	CHANGELOG.md
README.md	README.md
Snakefile	Snakefile
config.yaml	config.yaml

bit metagenomics workflow

This is a snakemake workflow for short-read metagenomics data. It processes via an assembly-based approach through to merged taxonomy and KO coverage tables, as well as recovering and characterizing MAGs. For all workflows available with bit, see here.

Overview
Usage
Primary outputs
Version info

Overview

This workflow perform an assembly-based approach to ultimately recover taxonomic and KO coverage tables across all samples, as well as attempt to recover and characterize bacterial/archaeal MAGs. It currently only assembles individual samples, and does not perform any co-assembly.

fastqc/multiqc for read-quality assessment and summarization
bbmap/bbduk for read quality-filtering and trimming
megahit for assembly
prodigal for gene prediction
bowtie2 for read mapping
KOFamScan for functional annotation
CAT with the NCBI nr database for contig and gene taxonomic assignment
metabat2 for binning of contigs
checkm2 for estimating quality of bacterial/archaeal bins
GTDB-tk for assigning taxonomy of bacterial/archaeal MAGs ("MAGs" defined by cutoffs in config.yaml)
KEGG-decoder for functionally summarizing recovered MAGs

All required databases will be setup by the workflow if they don't exist already whenever they are used for the first time. They can take up to 400 GB during initial setup, and take up around 300 GB afterwards.

Usage

bit should be installed via conda as described here.

Retrieving the worklfow

bit-get-workflow metagenomics

Creating the input file and modifying the config.yaml

Before running it, you first need to make a file holding unique portions of the filenames for all input samples, one per line in a single-column. And some variables need to be set in the config.yaml.

The primary things that need to be set in the config.yaml are designated in the first block of the config.yaml, these include things like: the unique sample ID file mentioned just above; where the input reads are located; their expected suffix; and where the reference databases are (or should be put if this is the first time running the workflow).

There are many other options/settings that can be changed if wanted, all described in the config.yaml.

Running the workflow

After variables are set in the config.yaml, here's an example of how it could be run (note that it should still be run inside the bit conda environment):

snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 4 -p

--use-conda – this specifies to use the conda environments included in the workflow
--conda-prefix – this allows us to point to where the needed conda environments should be stored. Including this means if we use the workflow on a different dataset somewhere else in the future, it will re-use the same conda environments rather than make new ones. The value listed here, ${CONDA_PREFIX}/envs, is the default location for conda environments (the variable ${CONDA_PREFIX} will be expanded to the appropriate location on whichever system it is run on).
-j – this lets us set how many jobs Snakemake should run concurrently (keep in mind that many of the thread and cpu parameters set in the config.yaml file will be multiplied by this)
-p – specifies to print out each command being run to the screen

See snakemake -h for more options and details.

Primary outputs

A primary output directory is produced called "workflow-outputs", with the following sub-directories and contents:

fastqc-outputs/
- multiqc summaries of fastqc reports for input and filtered reads
filtered-reads/
- quality-filtered reads
assemblies/
- assemblies in fasta format
- a tab-delimited table with assembly summary metrics ("assembly-summaries.tsv")
predicted-genes/
- predicted genes in fasta, faa, and gff format
read-mapping/
- bam files from mapping reads to their respective assemblies, as well as the mapping summary info, and the depth/coverage info used with metabat for binning
annotations-and-taxonomy/
- contig-level coverages and taxonomy files, and gene-level coverages, KO annotations, and taxonomy files
bins/
- bins recovered in fasta format
- an overview summary table ("bins-overview.tsv"), including summary metrics and checkm2 quality estimates
MAGs/
- bins that surpass the quality thresholds set in config.yaml (>= 90% estimated completion; <= 10 % estimation redundancy/contamination) in fasta format
- an overview summary table ("MAGs-overview.tsv"), including GTDB-assigned taxonomy
- KO annotations per MAG ("MAG-level-KO-annotations.tsv")
- MAG-level KO summaries via KEGGDecoder ("MAG-KEGG-Decoder-out.html" and "MAG-KEGG-Decoder-out.tsv")
combined-outputs/
- combined contig-level taxonomic coverages across samples
- combined gene-level KO coverages across samples
  - for each of those there is also a version normalized to coverage per million "CPM", where it's like percent except instead of being out of 100, it is out of 1,000,000
    - these CPM tables might be useful for some visualizations, but that type of conversion is typically not suitable for any statistics or hierarchical clustering/ordination

Please feel free to post an issue or email with any confusion about any of the outputs produced!

Other outputs

The workflow-outputs directory will also hold a "processing-overview.tsv" which just contains some basic info about major steps in the processing.

In the directory the workflow was executed from: a logs/ directory will hold log files for the majority of steps performed; and a benchmarks/ directory will hold time and resource utlization info (as described here for most steps performed.

Version info

Note that the workflows are version independently of the bit package. When you pull one with bit-get-workflow, the directory name will have the version, and it is also listed at the top of the Snakefile.

All versions of programs used can be found in their corresponding conda yaml file in the envs/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metagenomics-wf

metagenomics-wf

README.md

bit metagenomics workflow

Overview

Usage

Retrieving the worklfow

Creating the input file and modifying the config.yaml

Running the workflow

Primary outputs

Other outputs

Version info

Files

metagenomics-wf

Directory actions

More options

Directory actions

More options

Latest commit

History

metagenomics-wf

Folders and files

parent directory

README.md

bit metagenomics workflow

Overview

Usage

Retrieving the worklfow

Creating the input file and modifying the config.yaml

Running the workflow

Primary outputs

Other outputs

Version info