Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated fasta paths #24

Merged
merged 5 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 64 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)
[![license](https://img.shields.io/github/license/metagenlab/mess.svg)](https://github.com/metagenlab/MeSS/blob/main/LICENSE)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mess/README.html)
[![version](https://img.shields.io/conda/vn/bioconda/mess?color=blue)](http://bioconda.github.io/recipes/mess/README.html)
[![downloads](https://img.shields.io/conda/dn/bioconda/mess.svg)](https://anaconda.org/bioconda/mess)

Expand All @@ -14,7 +15,7 @@

The Metagenomic Sequence Simulator (MeSS) is a [Snakemake](https://github.com/snakemake/snakemake) pipeline, implemented using [Snaketool](https://github.com/beardymcjohnface/Snaketool), for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

## :memo: Overview
## :mag: Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in [CAMI format](https://github.com/bioboxes/rfc/blob/master/data-format/profiling.mkd).

Expand All @@ -25,7 +26,7 @@ input["samples.tsv
or
samples/*.tsv"] --> taxons

subgraph genome_download["genome download"]
subgraph genome_download["`**genome download**`"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
Expand All @@ -35,7 +36,7 @@ assembly_finder --> fasta
end

input --> distchoice
subgraph community_design["community design"]
subgraph community_design["`**community design**`"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
(lognormal, even)"]
Expand All @@ -58,9 +59,7 @@ simulator --> bam
simulator --> fastq
simulator --> CAMI-profile

%% colors
style genome_download color:black
style community_design color:black
%% subgraph color fills
classDef red fill:#faeaea,color:#fff,stroke:#333;
classDef blue fill:#eaecfa,color:#fff,stroke:#333;
class genome_download blue
Expand All @@ -71,47 +70,85 @@ class community_design red
More details can be found in the [documentation](https://metagenlab.github.io/MeSS/)

## :zap: Quick start
### Installation

#### Mamba

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mess/README.html)

### :gear: Installation
Mamba
```sh
mamba create -n mess mess
```

#### Docker

Docker
```sh
docker pull ghcr.io/metagenlab/mess:latest
```

#### From source

From source
```sh
git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS
```

### Usage
### :page_facing_up: Usage
#### :arrow_right: Input
Let's simulate two metagenomic samples with the following taxa and read counts in `samples.tsv`:
| sample | taxon | reads |
| --- | --- | --- |
| sample1 | 487 | 174840 |
| sample1 | 727 | 90679 |
| sample1 | 729 | 13129 |
| sample2 | 28132 | 147863 |
| sample2 | 199 | 147545 |
| sample2 | 729 | 131300 |

#### :rocket: Command
Let's run MeSS (using apptainer as the software deployment method) !
```sh
mess run -i samples.tsv --sdm apptainer
```
#### :card_index_dividers: Outputs

```sh
📦mess_out
┣ 📂assembly_finder
┃ ┣ 📂download
┃ ┃ ┣ 📂GCF_000144405.1
┃ ┃ ┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_001298465.1
┃ ┃ ┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_016127215.1
┃ ┃ ┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_020736045.1
┃ ┃ ┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_022869645.1
┃ ┃ ┃ ┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz
┃ ┃ ┗ 📜.snakemake_timestamp
┣ 📂fastq
┃ ┣ 📜sample1_R1.fq.gz
┃ ┣ 📜sample1_R2.fq.gz
┃ ┣ 📜sample2_R1.fq.gz
┃ ┗ 📜sample2_R2.fq.gz
┣ 📜config.yaml
┣ 📜coverages.tsv
┗ 📜mess.log
```

Outputs described in more details [here](https://metagenlab.github.io/MeSS/guide/output/)

#### Download and simulate
#### :bar_chart: Resources usage
Average resources usage measured 3 times with one CPU (within a [nextflow](https://github.com/nextflow-io/nextflow) process):

Using the following file [minimal_test.tsv](https://github.com/metagenlab/MeSS/blob/main/mess/test_data/minimal_test.tsv)
| task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
| ------- | --------- | --------- | -------- | --------- | ---- | ----------------------- | -------- | -------- | ------ | -------- | --------- | ------ | ------ |
| 1 | fe/03c2bc | 62286 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:41:15.820 | 1m 50s | 1m 50s | 111.5% | 1.8 GB | 9 GB | 3.5 GB | 2.4 GB |
| 1 | ff/0d03b1 | 73355 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:55:12.903 | 1m 52s | 1m 52s | 112.6% | 1.7 GB | 8.8 GB | 3.5 GB | 2.4 GB |
| 1 | 07/d352bf | 83576 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:57:30.600 | 1m 50s | 1m 50s | 113.2% | 1.7 GB | 8.9 GB | 3.5 GB | 2.4 GB |

```sh
mess run -i minimal_test.tsv
```
> On average, using `samples.tsv`, MeSS runs in under 2min, while using around 1.8GB of physical RAM

#### Simulate from local fasta
> [!NOTE]
> Resources usage was measured exluding dependencies deployement time (conda env creation or container pulling)

Download the [fasta directory](https://github.com/metagenlab/MeSS/tree/main/mess/test_data/fastas) and [table](https://github.com/metagenlab/MeSS/blob/main/mess/test_data/simulate_test.tsv)
More details on resource usage in the [documentation](https://metagenlab.github.io/MeSS/benchmarks/resource-usage/)

```sh
mess simulate -i simulate_test.tsv --fasta fasta
```

## :sos: Help

Expand Down
17 changes: 11 additions & 6 deletions mess/workflow/rules/preflight/functions.smk
Original file line number Diff line number Diff line change
Expand Up @@ -69,16 +69,21 @@ fasta_cache = {}

def fasta_input(wildcards):
table = checkpoints.calculate_genome_coverages.get(**wildcards).output[0]
if table not in fasta_cache:
df = pd.read_csv(table, sep="\t", index_col="fasta")
fasta_cache[table] = df
df = fasta_cache[table]
return df.loc[wildcards.fasta]["path"]

df = pd.read_csv(table, sep="\t", index_col="fasta")
try:
return df.loc[wildcards.fasta]["path"].drop_duplicates()
except AttributeError:
return df.loc[wildcards.fasta]["path"]
# some samples use the same genome path, drop duplicates to avoid duplicate paths when processing fasta


def list_fastas(wildcards):
table = checkpoints.calculate_genome_coverages.get(**wildcards).output[0]
df = pd.read_csv(table, sep="\t")
if table not in fasta_cache:
df = pd.read_csv(table, sep="\t")
fasta_cache[table] = df
df = fasta_cache[table]
fastas = list(set(df["fasta"]))
return expand(os.path.join(dir.out.processing, "{fasta}.fasta"), fasta=fastas)

Expand Down
2 changes: 2 additions & 0 deletions mess/workflow/scripts/samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
dfs = []
for file in files:
df = pd.read_csv(file, sep="\t")
df.columns = df.columns.str.replace(" ", "")
df = df.map(lambda x: x.replace(" ", "") if isinstance(x, str) else x)
dfs.append(df)
try:
samples = list(set(df["sample"]))
Expand Down
Loading