Merge pull request #12 from metagenlab/sep_af

Update for conda envs and Readme
metagenlab · Mar 11, 2022 · f4c10cc · f4c10cc
2 parents 4177aba + e2644c8
commit f4c10cc
Show file tree

Hide file tree

Showing 63 changed files with 8,296 additions and 605 deletions.
diff --git a/README.md b/README.md
@@ -4,55 +4,24 @@
 [![container](https://quay.io/repository/biocontainers/mess/status)](https://quay.io/repository/biocontainers/mess)
 
 The Metagenomic Sequence Simulator (MeSS) is a snakemake workflow used for simulating metagenomic mock communities.
-## Installation
-```bash
-git clone https://github.com/metagenlab/MeSS.git
-conda env create -f Mess/messenv.yml
-```
-### Or
+
+# Installation
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mess/README.html)
+
+In order to quickly get going with MeSS I recommend using the conda package manager and specifically mamba (a fast alternative to conda):
 ```bash
-conda install -c conda-forge mamba
+conda install -n base -c conda-forge mamba
 mamba create -c bioconda -n mess mess
 ```
-## Required files
-### Input table examples
-MeSS takes the same input as Assembly_finder, with an additional column for either coverage values, read percentages or
-relative abundances.
-#### Read percentage
-Below is an example of input table where the user can set, for each entry, read percentages of the total metagenomic reads
-
-TaxonomyInput | nb_genomes | PercentReads
---- | --- | ---
-1813735 | 1 | 0.3
-114185 | 1 |  0.4
-ATCC_13985 | 3 |  0.3
-
-If the percent read column is not present, MeSS will generate an even distribution within superkingdoms.
-In the input table shown above, if no PercentReads is present, each entry will have a read percentage of 20%
-(as all entries belong to the same superkingdom: bacteria)
-#### Coverage values
-The user has also the option to set coverage values instead of %reads of the total metagenomic reads for each entry.
-
-TaxonomyInput | nb_genomes | Coverage
---- | --- | ---
-1813735 | 1 | 20
-114185 | 1 |  30
-ATCC_13985 | 3 |  20
 
-In this case, all 3 assemblies found for ATCC_13985 will have the same coverage value of 20
-#### Relative proportions 
-Alternatively, the user can specify relative proportions between assemblies. Given the total number of reads to
-be present in the metagenome, scripts will calculate coverage and read numbers respecting the relative proportions.
-
-TaxonomyInput | nb_genomes | RelativeProp
---- | --- | ---
-1813735 | 1 | 0.3
-114185 | 1 |  0.4
-ATCC_13985 | 3 |  0.3
+# Quick start
+To run mess, you simply have to provide a config.yml file with a list of parameters:
+```bash
+mess run -f config.yml -c 10
+```
+Examples of config.yml files are provided in data/hmp_templates (parameters are explained below).
 
-For ATCC_13985, the 3 genomes will have a RelativeProp value of 0.1.
-### Config file example
+# Config file example
 ```yaml
 #MeSS parameters
 input_table_path: input_table.tsv
@@ -93,31 +62,82 @@ representative_assemblies: False
 exclude_from_metagenomes: True
 Genbank_assemblies: True
 Refseq_assemblies: True
-Rank_to_filter_by: 'None'
+Rank_to_filter_by: False
 ```
-#### Mess parameters
+# Required files
+## Input table examples
+MeSS takes the same input as Assembly_finder, with an additional column for either coverage values, read percentages or
+relative abundances.
+### Read percentage
+Below is an example of input table where the user can set, for each entry, read percentages of the total metagenomic reads
+
+Taxonomy | NbGenomes | PercentReads
+--- | --- | ---
+1813735 | 1 | 0.3
+114185 | 1 |  0.4
+ATCC_13985 | 3 |  0.3
+
+If the percent read column is not present, MeSS will generate an even distribution within superkingdoms.
+In the input table shown above, if no PercentReads is present, each entry will have a read percentage of 20%
+(as all entries belong to the same superkingdom: bacteria)
+### Coverage values
+The user has also the option to set coverage values instead of %reads of the total metagenomic reads for each entry.
+
+Taxonomy | NbGenomes | Coverage
+--- | --- | ---
+1813735 | 1 | 20
+114185 | 1 |  30
+ATCC_13985 | 3 |  20
+
+In this case, all 3 assemblies found for ATCC_13985 will have the same coverage value of 20
+### Relative proportions 
+Alternatively, the user can specify relative proportions between assemblies. Given the total number of reads to
+be present in the metagenome, scripts will calculate coverage and read numbers respecting the relative proportions.
+
+Taxonomy | NbGenomes | RelativeProp
+--- | --- | ---
+1813735 | 1 | 0.3
+114185 | 1 |  0.4
+ATCC_13985 | 3 |  0.3
+
+For ATCC_13985, the 3 genomes will have a RelativeProp value of 0.1.
+
+### Read counts
+
+Finally, the user can define the raw reads to simulate per genome as shown below: 
+
+ Taxonomy | NbGenomes | Reads
+--- | --- | ---
+1813735 | 1 | 10000
+114185 | 1 |  10000
+ATCC_13985 | 3 |  30000
+
+For ATCC_13985, 10000 reads will be simulated for each genome
+
+
+## Mess parameters
 The path to the input table can be set by the input_table_path parameter in the config file as shown above.
 
 MeSS offers the possibility to generate multiple mock communities using the same set of assembly files in 
 the same directory.
 For this, the user has to set up one configuration file per mock community and change the community_name accordingly. 
-#### Replicates parameters
+### Replicates parameters
 The user has the option to te create a set of replicates for one community. Each replicate read number can be drawn 
 from a normal distribution with a standard deviation set in the sd_read_num parameter. 
-#### Random seeds
+### Random seeds
 The MeSS workflow uses random seeds for read generation and read shuffling. To ensure reproducible results, one can
 give the seed parameter a fixed number.
 
-#### Sequencing run params
+### Sequencing run params
 MeSS offers the possibility to select art_illumina or pbsim2 to simulate illumina and long reads respecitvely.
 In addition, read pairing and the total amount of reads can be set using the read_status and total_reads parameters.
 
-#### Illumina (art params)
+### Illumina (art params)
 MeSS uses [art_illumina](https://academic.oup.com/bioinformatics/article/28/4/593/213322) to generate illumina reads, 
 and the user can change parameters like read and fragment length under the art_illumina params section as shown the 
 yaml file above. 
 
-#### Long reads (pbsim2 params)
+### Long reads (pbsim2 params)
 For long read simulation, [pbsim2](https://github.com/yukiteruono/pbsim2) was integrated in the pipeline.
 pbsim2 randomly samples reads from a reference sequence following a gamma distribution, and errors are introduced following 
 FIC-HMM models for different chemistries.
@@ -128,12 +148,12 @@ As for a Nanopore sequencing run using a R9.4 flowcell, the user can set the val
 
 For more details check [pbsim2's documentation](https://github.com/yukiteruono/pbsim2/blob/master/README.md)
 
-#### Assembly download
+### Assembly download
 MeSS uses [Assembly_finder](https://github.com/metagenlab/assembly_finder) to download genomes, and requires the user 
-to have an NCBI account. For more details on Assembly_finder parameters check its documentation.
+to have an NCBI account. For more details on Assembly_finder parameters check its [documentation](https://github.com/metagenlab/assembly_finder/blob/master/README.md).
 
-## Running MeSS
-### Snakemake command
+# Running MeSS
+## Snakemake command
 Here is an example command to run MeSS on the previously described config and input table.
 ```bash
 snakemake --snakefile path/to/MeSS/Snakefile --configfile config.yml \
@@ -147,13 +167,9 @@ Thus, for big genomes it is recommended to lower this parameter.
 
 **parallel_cat** controlls the number of genomes to be concatenated in parallel. 
 For big genomes and computers with low memory, lowering this parameter lowers memory usage.
-### Using the wrapper
-```bash
-mess run -f config.yml -p /path/to/conda/envs/ -c 10
-```
-Runs the Mess workflow using 10 cores.
-## MeSS outputs
-### Directory structure
+
+# MeSS outputs
+## Directory structure
 After running MeSS for two replicates of the same metagenome with single end reads, 
 your working directory should look like this:
 ```

diff --git a/config.yml b/config.yml
@@ -51,7 +51,7 @@ Genbank_assemblies: True
 Refseq_assemblies: True
 
 ##Parameters for the filtering function
-Rank_to_filter_by: 'None'
+Rank_to_filter_by: False
 #None: Assemblies are ranked by their assembly status (complete or not)
 #and Refseq category (reference, representative ...)
 #If you want to filter by species, set this parameter to 'species'. The filtering function will list all unique species