Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hamronization fargene input #411

Merged
merged 8 commits into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Breaking change`

[#391](https://github.com/nf-core/funcscan/pull/391) Made all "database" parameter names consistent, skip hmmsearch by default. (by @jasmezz)
- [#391](https://github.com/nf-core/funcscan/pull/391) Made all "database" parameter names consistent, skip hmmsearch by default. (by @jasmezz)

| Old parameter | New parameter |
| ------------------------------------------------ | --------------------------------------- |
Expand All @@ -27,6 +27,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
| `amp_skip_hmmsearch` | `amp_run_hmmsearch` |
| `bgc_skip_hmmsearch` | `bgc_run_hmmsearch` |

- [#343](https://github.com/nf-core/funcscan/pull/343) Standardized the resulting workflow summary tables to always start with 'sample_id\tcontig_id\t..'. Reformatted the output of `hamronization/summarize` module. (by @darcy220606)
- [#411](https://github.com/nf-core/funcscan/pull/411) Optimised hAMRonization input: only high-quality hits from fARGene output are reported. (by @jasmezz, @jfy133)

### `Added`

- [#322](https://github.com/nf-core/funcscan/pull/322) Updated all modules: introduce environment.yml files. (by @jasmezz)
Expand All @@ -44,7 +47,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Fixed`

- [#343](https://github.com/nf-core/funcscan/pull/343) Standardized the resulting workflow summary tables to always start with 'sample_id\tcontig_id\t..'. Reformatted the output of `hamronization/summarize` module. (by @darcy220606)
- [#348](https://github.com/nf-core/funcscan/pull/348) Updated samplesheet for pipeline tests to 'samplesheet_reduced.csv' with smaller datasets to reduce resource consumption. Updated prodigal module to fix pigz issue. Removed `tests/` from `.gitignore`. (by @darcy220606)
- [#362](https://github.com/nf-core/funcscan/pull/362) Save annotations from bakta in subdirectories per sample. (by @jasmezz)
- [#363](https://github.com/nf-core/funcscan/pull/363) Removed warning from DeepBGC usage docs. (by @jasmezz)
Expand All @@ -53,7 +55,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#376](https://github.com/nf-core/funcscan/pull/376) Fixed an occasional RGI process failure when certain files not produced. (❤️ to @amizeranschi for reporting, fix by @amizeranschi & @jfy133)
- [#386](https://github.com/nf-core/funcscan/pull/386) Updated DeepBGC module to fix output file names, separate annotation step for all BGC tools, add warning if no BGCs found, fix MultiQC reporting of annotation workflow. (by @jfy133, @jasmezz)
- [#392](https://github.com/nf-core/funcscan/pull/392) & [#397](https://github.com/nf-core/funcscan/pull/397) Fixed a docker/singularity only error appearing when running with conda. (❤️ to @ewissel for reporting, fix by @jfy33 & @jasmezz)
- [#394](https://github.com/nf-core/funcscan/pull/394) Fixed BGC input channel: pre-annotated input is picked up correctly now. (by @jfy133, @jasmezz)
- [#391](https://github.com/nf-core/funcscan/pull/391) Skip hmmmsearch by default to not crash pipeline if user provides no HMM files, updated docs. (by @jasmezz)
- [#397](https://github.com/nf-core/funcscan/pull/397) Removed deprecated AMPcombi module, fixed variable name in BGC workflow, updated minor parts in docs (usage, parameter schema). (by @jasmezz)
- [#402](https://github.com/nf-core/funcscan/pull/402) Fixed BGC length calculation for antiSMASH hits by comBGC. (by @jasmezz)
Expand Down
4 changes: 2 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -279,13 +279,13 @@ process {
path: { "${params.outdir}/arg/fargene/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
pattern: "*/{predictedGenes,retrievedFragments}/*"
pattern: "*/{hmmsearchresults,predictedGenes,retrievedFragments}/*"
],
[
path: { "${params.outdir}/arg/fargene/${meta.id}/" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
pattern: "*/{hmmsearchresults,tmpdir}/*",
pattern: "*/{tmpdir}/*",
enabled: params.arg_fargene_savetmpfiles
]
]
Expand Down
2 changes: 1 addition & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ Output Summaries:
- `fargene/`
- `fargene_analysis.log`: logging output that fARGene produced during its run
- `<sample_name>/`:
- `hmmsearchresults/`: output from intermediate hmmsearch step (only if `--arg_fargene_savetmpfiles` supplied)
- `hmmsearchresults/`: output from intermediate hmmsearch step
- `predictedGenes/`:
- `*-filtered.fasta`: nucleotide sequences of predicted ARGs
- `*-filtered-peptides.fasta`: amino acid sequences of predicted ARGs
Expand Down
103 changes: 65 additions & 38 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,31 @@ work # Directory containing temporary files required for the run
# Other nextflow hidden files, eg. history of pipeline runs and old logs
```

If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.

Pipeline settings can be provided in a `yaml` or `json` file via `-params-file <file>`.

:::warning
Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
:::

The above pipeline run specified with a params file in yaml format:

```bash
nextflow run nf-core/funcscan -profile docker -params-file params.yaml
```

with `params.yaml` containing:

```yaml
input: './samplesheet.csv'
outdir: './results/'
genome: 'GRCh37'
<...>
```

You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).

## Samplesheet input

nf-core/funcscan takes FASTA files as input, typically contigs or whole genome sequences. To supply these to the pipeline, you will need to create a samplesheet with information about the samples you would like to analyse. Use this parameter to specify its location.
Expand Down Expand Up @@ -95,13 +120,15 @@ The implementation of some tools in the pipeline may have some particular behavi

MMseqs2 is currently the only taxonomic classification tool used in the pipeline to assign a taxonomic lineage to the input contigs. The database used to assign the taxonomic lineage can either be:

- a custom based database created by the user using `mmseqs createdb` externally and beforehand. If this flag is assigned, this database takes precedence over the default database in `--mmseqs_db_id`.
- A custom based database created by the user using `mmseqs createdb` externally and beforehand. If this flag is assigned, this database takes precedence over the default database in `--mmseqs_db_id`.

```bash
--taxa_classification_mmseqs_db 'path/to/mmsesqs_custom_database/dir'
--taxa_classification_mmseqs_db '<path>/<to>/<mmsesqs_custom_database>/<directory>'
```

- an MMseqs2 ready database. These databases were compiled by the developers of MMseqs2 and can be called using their labels. All available options can be found [here](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases). Only use those databases that have taxonomy files available (i.e., Taxonomy == Yes). By default mmseqs2 in the pipeline uses '[Kalamari](https://github.com/lskatz/Kalamari)', and runs an aminoacid based alignment. However, if the user requires a more comprehensive taxonomic classification, we recommend the use of [GTDB](https://gtdb.ecogenomic.org/), but for that please remember to increase the memory, CPU threads and time required for the process `MMSEQS_TAXONOMY`.
The contents of the directory should have files such as `<dbname>.version` and `<dbname>.taxonomy` in the top level.

- An MMseqs2 ready database. These databases were compiled by the developers of MMseqs2 and can be called using their labels. All available options can be found [here](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases). Only use those databases that have taxonomy files available (i.e., Taxonomy == Yes). By default mmseqs2 in the pipeline uses '[Kalamari](https://github.com/lskatz/Kalamari)', and runs an aminoacid based alignment. However, if the user requires a more comprehensive taxonomic classification, we recommend the use of [GTDB](https://gtdb.ecogenomic.org/), but for that please remember to increase the memory, CPU threads and time required for the process `MMSEQS_TAXONOMY`.

```bash
--taxa_classification_mmseqs_db_id 'Kalamari'
Expand Down Expand Up @@ -146,9 +173,11 @@ tar xvzf db.tar.gz
And then passed to the pipeline with:

```bash
--annotation_bakta_db /<path>/<to>/db/
--annotation_bakta_db /<path>/<to>/<db>/
```

The contents of the directory should have files such as `*.dmnd` in the top level.

:::info
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
:::
Expand All @@ -174,9 +203,11 @@ Ensure to wrap this path in double quotes if using an asterisk, to ensure Nextfl
For AMPcombi, nf-core/funcscan will by default download the most recent version of the [DRAMP](http://dramp.cpu-bioinfor.org/) database as a reference database for aligning the AMP hits in the AMP workflow. However, the user can also supply their own custom AMP database by following the guidelines in [AMPcombi](https://github.com/Darcy220606/AMPcombi). This can then be passed to the pipeline with:

```bash
--amp_ampcombi_db '/<path>/<to>/<amp_ref_database>
--amp_ampcombi_db '/<path>/<to>/<ampcombi_database>
```

The contents of the directory should have files such as `*.dmnd` and `*.fasta` in the top level.

:::warning
The pipeline will automatically run Pyrodigal instead of Prodigal if the parameters `--run_annotation_tool prodigal --run_amp_screening` are both provided. This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools.
:::
Expand Down Expand Up @@ -210,21 +241,28 @@ conda activate abricate

## Download the bacmet2 database
abricate-get_db --db bacmet2 ## the logging will tell you where the database is downloaded to, e.g. /home/<user>/bin/miniconda3/envs/abricate/db/bacmet2/sequences
```

## Run nextflow
nextflow run nf-core/funcscan -r <version> -profile docker --input samplesheet.csv --outdir <outdir> --run_arg_screening --arg_abricate_db /home/<user>/bin/miniconda3/envs/abricate/db/ --arg_abricate_db_id bacmet2
The resulting directory and database name can be passed to the pipeline as follows

```bash
--arg_abricate_db /<path>/<to>/<abricate>/db/ --arg_abricate_db_id bacmet2
```

The contents of the directory should have a directory named with the database name in the top level (e.g. `bacmet2/`).

### AMRFinderPlus

AMRFinderPlus relies on NCBI's curated Reference Gene Database and curated collection of Hidden Markov Models.

nf-core/funcscan will download this database for you, unless the path to a local version is given with:

```bash
--arg_amrfinderplus_db '/<path>/<to>/<amrfinderplus_db>/'
--arg_amrfinderplus_db '/<path>/<to>/<amrfinderplus_db>/latest'
```

You must give the `latest` directory to the pipeline, and the contents of the directory should include files such as `*.nbd`, `*.nhr`, `versions.txt` etc. in the top level.

To obtain a local version of the database:

1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `<installation>/<path>/funcscan/modules/nf-core/amrfinderplus/run/environment.yml`).
Expand Down Expand Up @@ -284,6 +322,8 @@ You can then supply the path to resulting database directory with:
--arg_deeparg_db '/<path>/<to>/<deeparg>/<db>/'
```

The contents of the directory should include directories such as `database`, `moderl`, and files such as `deeparg.gz` etc. in the top level.

Note that if you supply your own database that is not downloaded by the pipeline, make sure to also supply `--arg_deeparg_db_version` along
with the version number so hAMRonization will correctly display the database version in the summary report.

Expand All @@ -304,6 +344,8 @@ You can then supply the path to resulting database directory with:
--arg_rgi_db '/<path>/<to>/<card>/'
```

The contents of the directory should include files such as `card.json`, `aro_index.tsv`, `snps.txt` etc. in the top level.

:::info
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
:::
Expand All @@ -324,24 +366,34 @@ To supply the database directories to the pipeline:

```bash
--bgc_antismash_db '/<path>/<to>/<antismash>/<db>/'
--bgc_antismash_installdir '/<path>/<to>/<antismash>/<dir>/'
--bgc_antismash_installdir '/<path>/<to>/<antismash>/<dir>/antismash'
```

Note that the names of the supplied folders must differ from each other (e.g. `antismash_db` and `antismash_dir`). If they are not provided, the databases will be auto-downloaded upon each BGC screening run of the pipeline.
The contents of the database directory should include directories such as `as-js/`, `clusterblast/`, `clustercompare/` etc. in the top level.
The contents of the installation directory should include directories such as `common/` `config/` and files such as `custom_typing.py` `custom_typing.pyi` etc. in the top level.

:::info
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
If installing with conda, the installation directory will be `lib/python3.10/site-packages/antismash` from the base directory of your conda install or conda environment directory.
:::

Note that the names of the two required folders must differ from each other (i.e., the `--bgc_antismash_db` directory must not be called `antismash`).
If they are not provided, the databases will be auto-downloaded upon each BGC screening run of the pipeline.

:::info
If installing with conda, the installation directory will be `lib/python3.10/site-packages/antismash` from the base directory of your conda install or conda environment directory.
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
:::

### DeepBGC

DeepBGC relies on trained models and Pfams to run its analysis. nf-core/funcscan will download these databases for you. If the flag `--save_db` is set, the downloaded files will be stored in the output directory under `databases/deepbgc/`.

Alternatively, if you already downloaded the database locally with `deepbgc download`, you can indicate the path to the database folder with `--bgc_deepbgc_db <path>/<to>/<deepbgc_db>/`. The folder has to contain the subfolders as in the database folder downloaded by `deepbgc download`:
Alternatively, if you already downloaded the database locally with `deepbgc download`, you can indicate the path to the database folder with:

```bash
--bgc_deepbgc_db <path>/<to>/<deepbgc_db>/
```

The contents of the database directory should include directories such as `common`, `0.1.0` in the top level.

```console
deepbgc_db/
Expand All @@ -354,31 +406,6 @@ deepbgc_db/
└── myDetectors*.pkl
```

If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.

Pipeline settings can be provided in a `yaml` or `json` file via `-params-file <file>`.

:::warning
Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
:::

The above pipeline run specified with a params file in yaml format:

```bash
nextflow run nf-core/funcscan -profile docker -params-file params.yaml
```

with `params.yaml` containing:

```yaml
input: './samplesheet.csv'
outdir: './results/'
genome: 'GRCh37'
<...>
```

You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).

## Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Expand Down
12 changes: 6 additions & 6 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,11 @@
"git_sha": "4e5f4687318f24ba944a13609d3ea6ebd890737d",
"installed_by": ["modules"]
},
"argnorm": {
"branch": "master",
"git_sha": "e4fc46af5ec30070e6aef780aba14f89a28caa88",
"installed_by": ["modules"]
},
"bakta/bakta": {
"branch": "master",
"git_sha": "9d0f89b445e1f5b2fb30476f4be9a8b519c07846",
Expand Down Expand Up @@ -87,7 +92,7 @@
},
"fargene": {
"branch": "master",
"git_sha": "a7231cbccb86535529e33859e05d19ac93f3ea04",
"git_sha": "9cf6f5e4ad9cc11a670a94d56021f1c4f9a91ec1",
"installed_by": ["modules"]
},
"gecco/run": {
Expand Down Expand Up @@ -205,11 +210,6 @@
"git_sha": "4e5f4687318f24ba944a13609d3ea6ebd890737d",
"installed_by": ["modules"],
"patch": "modules/nf-core/untar/untar.diff"
},
"argnorm": {
"branch": "master",
"git_sha": "e4fc46af5ec30070e6aef780aba14f89a28caa88",
"installed_by": ["modules"]
}
}
},
Expand Down
Loading