Skip to content

Commit

Permalink
Merge pull request #12 from metagenlab/sep_af
Browse files Browse the repository at this point in the history
Update for conda envs and Readme
  • Loading branch information
idfarbanecha authored Mar 11, 2022
2 parents 4177aba + e2644c8 commit f4c10cc
Show file tree
Hide file tree
Showing 63 changed files with 8,296 additions and 605 deletions.
138 changes: 77 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,55 +4,24 @@
[![container](https://quay.io/repository/biocontainers/mess/status)](https://quay.io/repository/biocontainers/mess)

The Metagenomic Sequence Simulator (MeSS) is a snakemake workflow used for simulating metagenomic mock communities.
## Installation
```bash
git clone https://github.com/metagenlab/MeSS.git
conda env create -f Mess/messenv.yml
```
### Or

# Installation
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mess/README.html)

In order to quickly get going with MeSS I recommend using the conda package manager and specifically mamba (a fast alternative to conda):
```bash
conda install -c conda-forge mamba
conda install -n base -c conda-forge mamba
mamba create -c bioconda -n mess mess
```
## Required files
### Input table examples
MeSS takes the same input as Assembly_finder, with an additional column for either coverage values, read percentages or
relative abundances.
#### Read percentage
Below is an example of input table where the user can set, for each entry, read percentages of the total metagenomic reads

TaxonomyInput | nb_genomes | PercentReads
--- | --- | ---
1813735 | 1 | 0.3
114185 | 1 | 0.4
ATCC_13985 | 3 | 0.3

If the percent read column is not present, MeSS will generate an even distribution within superkingdoms.
In the input table shown above, if no PercentReads is present, each entry will have a read percentage of 20%
(as all entries belong to the same superkingdom: bacteria)
#### Coverage values
The user has also the option to set coverage values instead of %reads of the total metagenomic reads for each entry.

TaxonomyInput | nb_genomes | Coverage
--- | --- | ---
1813735 | 1 | 20
114185 | 1 | 30
ATCC_13985 | 3 | 20

In this case, all 3 assemblies found for ATCC_13985 will have the same coverage value of 20
#### Relative proportions
Alternatively, the user can specify relative proportions between assemblies. Given the total number of reads to
be present in the metagenome, scripts will calculate coverage and read numbers respecting the relative proportions.

TaxonomyInput | nb_genomes | RelativeProp
--- | --- | ---
1813735 | 1 | 0.3
114185 | 1 | 0.4
ATCC_13985 | 3 | 0.3
# Quick start
To run mess, you simply have to provide a config.yml file with a list of parameters:
```bash
mess run -f config.yml -c 10
```
Examples of config.yml files are provided in data/hmp_templates (parameters are explained below).

For ATCC_13985, the 3 genomes will have a RelativeProp value of 0.1.
### Config file example
# Config file example
```yaml
#MeSS parameters
input_table_path: input_table.tsv
Expand Down Expand Up @@ -93,31 +62,82 @@ representative_assemblies: False
exclude_from_metagenomes: True
Genbank_assemblies: True
Refseq_assemblies: True
Rank_to_filter_by: 'None'
Rank_to_filter_by: False
```
#### Mess parameters
# Required files
## Input table examples
MeSS takes the same input as Assembly_finder, with an additional column for either coverage values, read percentages or
relative abundances.
### Read percentage
Below is an example of input table where the user can set, for each entry, read percentages of the total metagenomic reads
Taxonomy | NbGenomes | PercentReads
--- | --- | ---
1813735 | 1 | 0.3
114185 | 1 | 0.4
ATCC_13985 | 3 | 0.3
If the percent read column is not present, MeSS will generate an even distribution within superkingdoms.
In the input table shown above, if no PercentReads is present, each entry will have a read percentage of 20%
(as all entries belong to the same superkingdom: bacteria)
### Coverage values
The user has also the option to set coverage values instead of %reads of the total metagenomic reads for each entry.
Taxonomy | NbGenomes | Coverage
--- | --- | ---
1813735 | 1 | 20
114185 | 1 | 30
ATCC_13985 | 3 | 20
In this case, all 3 assemblies found for ATCC_13985 will have the same coverage value of 20
### Relative proportions
Alternatively, the user can specify relative proportions between assemblies. Given the total number of reads to
be present in the metagenome, scripts will calculate coverage and read numbers respecting the relative proportions.
Taxonomy | NbGenomes | RelativeProp
--- | --- | ---
1813735 | 1 | 0.3
114185 | 1 | 0.4
ATCC_13985 | 3 | 0.3
For ATCC_13985, the 3 genomes will have a RelativeProp value of 0.1.
### Read counts
Finally, the user can define the raw reads to simulate per genome as shown below:
Taxonomy | NbGenomes | Reads
--- | --- | ---
1813735 | 1 | 10000
114185 | 1 | 10000
ATCC_13985 | 3 | 30000
For ATCC_13985, 10000 reads will be simulated for each genome
## Mess parameters
The path to the input table can be set by the input_table_path parameter in the config file as shown above.
MeSS offers the possibility to generate multiple mock communities using the same set of assembly files in
the same directory.
For this, the user has to set up one configuration file per mock community and change the community_name accordingly.
#### Replicates parameters
### Replicates parameters
The user has the option to te create a set of replicates for one community. Each replicate read number can be drawn
from a normal distribution with a standard deviation set in the sd_read_num parameter.
#### Random seeds
### Random seeds
The MeSS workflow uses random seeds for read generation and read shuffling. To ensure reproducible results, one can
give the seed parameter a fixed number.
#### Sequencing run params
### Sequencing run params
MeSS offers the possibility to select art_illumina or pbsim2 to simulate illumina and long reads respecitvely.
In addition, read pairing and the total amount of reads can be set using the read_status and total_reads parameters.
#### Illumina (art params)
### Illumina (art params)
MeSS uses [art_illumina](https://academic.oup.com/bioinformatics/article/28/4/593/213322) to generate illumina reads,
and the user can change parameters like read and fragment length under the art_illumina params section as shown the
yaml file above.
#### Long reads (pbsim2 params)
### Long reads (pbsim2 params)
For long read simulation, [pbsim2](https://github.com/yukiteruono/pbsim2) was integrated in the pipeline.
pbsim2 randomly samples reads from a reference sequence following a gamma distribution, and errors are introduced following
FIC-HMM models for different chemistries.
Expand All @@ -128,12 +148,12 @@ As for a Nanopore sequencing run using a R9.4 flowcell, the user can set the val
For more details check [pbsim2's documentation](https://github.com/yukiteruono/pbsim2/blob/master/README.md)
#### Assembly download
### Assembly download
MeSS uses [Assembly_finder](https://github.com/metagenlab/assembly_finder) to download genomes, and requires the user
to have an NCBI account. For more details on Assembly_finder parameters check its documentation.
to have an NCBI account. For more details on Assembly_finder parameters check its [documentation](https://github.com/metagenlab/assembly_finder/blob/master/README.md).
## Running MeSS
### Snakemake command
# Running MeSS
## Snakemake command
Here is an example command to run MeSS on the previously described config and input table.
```bash
snakemake --snakefile path/to/MeSS/Snakefile --configfile config.yml \
Expand All @@ -147,13 +167,9 @@ Thus, for big genomes it is recommended to lower this parameter.

**parallel_cat** controlls the number of genomes to be concatenated in parallel.
For big genomes and computers with low memory, lowering this parameter lowers memory usage.
### Using the wrapper
```bash
mess run -f config.yml -p /path/to/conda/envs/ -c 10
```
Runs the Mess workflow using 10 cores.
## MeSS outputs
### Directory structure

# MeSS outputs
## Directory structure
After running MeSS for two replicates of the same metagenome with single end reads,
your working directory should look like this:
```
Expand Down
2 changes: 1 addition & 1 deletion config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Genbank_assemblies: True
Refseq_assemblies: True

##Parameters for the filtering function
Rank_to_filter_by: 'None'
Rank_to_filter_by: False
#None: Assemblies are ranked by their assembly status (complete or not)
#and Refseq category (reference, representative ...)
#If you want to filter by species, set this parameter to 'species'. The filtering function will list all unique species
Expand Down
Loading

0 comments on commit f4c10cc

Please sign in to comment.