From 16c8ab6a2de03e7c61068a336772cb6a1e7f5dc5 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Wed, 14 Aug 2024 09:41:01 +0200 Subject: [PATCH 1/5] Improve documenation for all download help --- docs/usage.md | 154 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 112 insertions(+), 42 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index f58afa9a..06d31b51 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -44,6 +44,31 @@ work # Directory containing temporary files required for the run # Other nextflow hidden files, eg. history of pipeline runs and old logs ``` +If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. + +Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `. + +:::warning +Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args). +::: + +The above pipeline run specified with a params file in yaml format: + +```bash +nextflow run nf-core/funcscan -profile docker -params-file params.yaml +``` + +with `params.yaml` containing: + +```yaml +input: './samplesheet.csv' +outdir: './results/' +genome: 'GRCh37' +<...> +``` + +You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). + ## Samplesheet input nf-core/funcscan takes FASTA files as input, typically contigs or whole genome sequences. To supply these to the pipeline, you will need to create a samplesheet with information about the samples you would like to analyse. Use this parameter to specify its location. @@ -135,7 +160,19 @@ As a reference, we will describe below where and how you can obtain databases an nf-core/funcscan offers multiple tools for annotating input sequences. Bakta is a new tool touted as a bacteria-only successor to the well-established Prokka. -To supply the preferred Bakta database (and not have the pipeline download it for every new run), use the flag `--annotation_bakta_db`. The full or light Bakta database must be downloaded from the Bakta Zenodo archive, the link of which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download). +To supply the preferred Bakta database (and not have the pipeline download it for every new run), use the flag `--annotation_bakta_db`. +The full or light Bakta database must be downloaded from the Bakta Zenodo archive. + +You can do this by installing via conda and using the dedicated command + +```bash +conda create -n bakta -c bioconda bakta +conda activate bakta + +bakta_db download --output +``` + +Alternatively, you can manually download the filesvia the links of which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download). Once downloaded this must be untarred: @@ -157,7 +194,8 @@ The flag `--save_db` saves the pipeline-downloaded databases in your results dir nf-core/funcscan allows screening of sequences for functional genes associated with various natural product types via Hidden Markov Models (HMMs) using hmmsearch. -This requires supplying a list of HMM files ending in `.hmm`, that have models for the particular molecule(s) or BGCs you are interested in. You can download these files from places such as [PFAM](https://www.ebi.ac.uk/interpro/download/Pfam/) for antimicrobial peptides (AMP), or the antiSMASH GitHub repository for [biosynthetic gene cluster](https://github.com/antismash/antismash/tree/master/antismash/detection/hmm_detection/data) related HMMs, or create them yourself. +This requires supplying a list of HMM files ending in `.hmm`, that have models for the particular molecule(s) or BGCs you are interested in. +You can download these files from places such as [PFAM](https://www.ebi.ac.uk/interpro/download/Pfam/) for antimicrobial peptides (AMP), or the antiSMASH GitHub repository for [biosynthetic gene cluster](https://github.com/antismash/antismash/tree/master/antismash/detection/hmm_detection/data) related HMMs, or create them yourself. You should place all HMMs in a directory, supply them to the AMP or BGC workflow and switch hmmsearch on: @@ -171,14 +209,27 @@ Ensure to wrap this path in double quotes if using an asterisk, to ensure Nextfl ### AMPcombi -For AMPcombi, nf-core/funcscan will by default download the most recent version of the [DRAMP](http://dramp.cpu-bioinfor.org/) database as a reference database for aligning the AMP hits in the AMP workflow. However, the user can also supply their own custom AMP database by following the guidelines in [AMPcombi](https://github.com/Darcy220606/AMPcombi). This can then be passed to the pipeline with: +For AMPcombi, nf-core/funcscan will by default download the most recent version of the [DRAMP](http://dramp.cpu-bioinfor.org/) database as a reference database, and modifies the files for aligning the AMP hits in the AMP workflow. + +nf-core/funcscan currently provides a python3 helper script to do these steps. + +```bash +mkdir -p ampcombi/amp_ref_database +cd ampcombi/ +wget https://github.com/nf-core/funcscan/raw//bin/ampcombi_download.py +python3 ampcombi_download.py +``` + +However, the user can also supply their own custom AMP database by following the guidelines in [AMPcombi](https://github.com/Darcy220606/AMPcombi). +This can then be passed to the pipeline with: ```bash --amp_ampcombi_db '/// ``` :::warning -The pipeline will automatically run Pyrodigal instead of Prodigal if the parameters `--run_annotation_tool prodigal --run_amp_screening` are both provided. This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools. +The pipeline will automatically run Pyrodigal instead of Prodigal if the parameters `--run_annotation_tool prodigal --run_amp_screening` are both provided. +This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools. ::: ### Abricate @@ -227,8 +278,16 @@ nf-core/funcscan will download this database for you, unless the path to a local To obtain a local version of the database: -1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`). -2. Run `amrfinder --update`, which will download the latest version of the AMRFinderPlus database to the default location (location of the AMRFinderPlus binaries/data). It creates a directory in the format YYYY-MM-DD.version (e.g., `//data/2024-01-31.1/`). +1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus). + To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`). + + ```bash + conda create -n amrfinderplus -c bioconda ncbi-amrfinderplus=3.12.8 + conda activate amrfinderplus + ``` + +2. Run `amrfinder --update`, which will download the latest version of the AMRFinderPlus database to the default location (location of the AMRFinderPlus binaries/data). + It creates a directory in the format YYYY-MM-DD.version (e.g., `//data/2024-01-31.1/`).
AMR related files in the database folder @@ -269,6 +328,12 @@ nf-core/funcscan can download this database for you, however it is very slow and You can either: 1. Install DeepARG from [bioconda](https://bioconda.github.io/recipes/deeparg/README.html?highlight=deeparg) + + ```bash + conda create -n deeparg -c bioconda deeparg + conda activate deeparg + ``` + 2. Run `deeparg download_data -o ////` Or download the files directly from @@ -284,19 +349,29 @@ You can then supply the path to resulting database directory with: --arg_deeparg_db '/////' ``` -Note that if you supply your own database that is not downloaded by the pipeline, make sure to also supply `--arg_deeparg_db_version` along -with the version number so hAMRonization will correctly display the database version in the summary report. +Note that if you supply your own database that is not downloaded by the pipeline, make sure to also supply `--arg_deeparg_db_version` along with the version number so hAMRonization will correctly display the database version in the summary report. :::info -The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future. +The flag `--save_db` saves the pipeline-downloaded databases in your results directory. +You can then move these to a central cache directory of your choice for re-use in the future. ::: ### RGI -RGI requires the database CARD which can be downloaded by nf-core/funcscan or supplied by the user manually. To download and supply the database yourself, do: +RGI requires the database CARD which can be downloaded by nf-core/funcscan or supplied by the user manually. +To download and supply the database yourself, do: 1. Download [CARD](https://card.mcmaster.ca/latest/data) -2. Extract the archive. + + ```bash + wget https://card.mcmaster.ca/latest/data + ``` + +2. Extract the (`.tar.bz2`) archive. + + ```bash + tar -xjvf data + ``` You can then supply the path to resulting database directory with: @@ -305,7 +380,8 @@ You can then supply the path to resulting database directory with: ``` :::info -The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future. +The flag `--save_db` saves the pipeline-downloaded databases in your results directory. +You can then move these to a central cache directory of your choice for re-use in the future. ::: ### antiSMASH @@ -318,8 +394,14 @@ The same applies for the antiSMASH installation directory, which is also a requi To supply the database directories to the pipeline: -1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html) -2. Run `download-antismash-databases` +1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html) (To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release - check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`) + + ```bash + conda create -n antismash-lite -c bioconda antismash-lite + conda activate antismash-lite + ``` + +2. Run the following command `download-antismash-databases`. Use `--database-dir` to specify a new location. 3. You can then supply the paths to the resulting databases and the whole installation directory with: ```bash @@ -327,7 +409,8 @@ To supply the database directories to the pipeline: --bgc_antismash_installdir '/////' ``` -Note that the names of the supplied folders must differ from each other (e.g. `antismash_db` and `antismash_dir`). If they are not provided, the databases will be auto-downloaded upon each BGC screening run of the pipeline. +Note that the names of the supplied folders must differ from each other (e.g. `antismash_db` and `antismash_dir`). +If they are not provided, the databases will be auto-downloaded upon each BGC screening run of the pipeline. :::info The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future. @@ -339,9 +422,21 @@ If installing with conda, the installation directory will be `lib/python3.10/sit ### DeepBGC -DeepBGC relies on trained models and Pfams to run its analysis. nf-core/funcscan will download these databases for you. If the flag `--save_db` is set, the downloaded files will be stored in the output directory under `databases/deepbgc/`. +DeepBGC relies on trained models and Pfams to run its analysis. +nf-core/funcscan will download these databases for you. If the flag `--save_db` is set, the downloaded files will be stored in the output directory under `databases/deepbgc/`. + +Alternatively, you downloaded the database locally with: + +```bash +conda create -n deepbgc -c bioconda deepbgc +conda activate deepbgc +export DEEPBGC_DOWNLOADS_DIR= +deepbgc download + +```` -Alternatively, if you already downloaded the database locally with `deepbgc download`, you can indicate the path to the database folder with `--bgc_deepbgc_db ///`. The folder has to contain the subfolders as in the database folder downloaded by `deepbgc download`: +You can then indicate the path to the database folder in the pipeline with `--bgc_deepbgc_db ///`. +The folder has to contain the subfolders as in the database folder downloaded by `deepbgc download`: ```console deepbgc_db/ @@ -354,31 +449,6 @@ deepbgc_db/ └── myDetectors*.pkl ``` -If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. - -Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `. - -:::warning -Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args). -::: - -The above pipeline run specified with a params file in yaml format: - -```bash -nextflow run nf-core/funcscan -profile docker -params-file params.yaml -``` - -with `params.yaml` containing: - -```yaml -input: './samplesheet.csv' -outdir: './results/' -genome: 'GRCh37' -<...> -``` - -You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). - ## Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: From 2aa130f29b9e468d6fede6d56d9bc1145291c413 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Wed, 14 Aug 2024 09:42:37 +0200 Subject: [PATCH 2/5] Update CHANGELOG.md --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4a9cf09b..45d1fd17 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -60,6 +60,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#406](https://github.com/nf-core/funcscan/pull/406) Fixed prediction tools not being executed if annotation workflow skipped. (by @jasmezz) - [#407](https://github.com/nf-core/funcscan/pull/407) Fixed comBGC bug when parsing multiple antiSMASH files. (by @jasmezz) - [#409](https://github.com/nf-core/funcscan/pull/409) Fixed argNorm overwriting its output for DeepARG. (by @jasmezz, @jfy133) +- [#412](https://github.com/nf-core/funcscan/pull/412) Improve all pre-run database download documentation (by @jfy133) ### `Dependencies` From d8f7a5cb89a0817e9340377859746690ae456817 Mon Sep 17 00:00:00 2001 From: nf-core-bot Date: Wed, 14 Aug 2024 07:43:44 +0000 Subject: [PATCH 3/5] [automated] Fix linting with Prettier --- docs/usage.md | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 06d31b51..9df8a15f 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -281,10 +281,10 @@ To obtain a local version of the database: 1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`). - ```bash - conda create -n amrfinderplus -c bioconda ncbi-amrfinderplus=3.12.8 - conda activate amrfinderplus - ``` +```bash +conda create -n amrfinderplus -c bioconda ncbi-amrfinderplus=3.12.8 +conda activate amrfinderplus +``` 2. Run `amrfinder --update`, which will download the latest version of the AMRFinderPlus database to the default location (location of the AMRFinderPlus binaries/data). It creates a directory in the format YYYY-MM-DD.version (e.g., `//data/2024-01-31.1/`). @@ -328,11 +328,11 @@ nf-core/funcscan can download this database for you, however it is very slow and You can either: 1. Install DeepARG from [bioconda](https://bioconda.github.io/recipes/deeparg/README.html?highlight=deeparg) - - ```bash - conda create -n deeparg -c bioconda deeparg - conda activate deeparg - ``` + +```bash +conda create -n deeparg -c bioconda deeparg +conda activate deeparg +``` 2. Run `deeparg download_data -o ////` @@ -363,15 +363,15 @@ To download and supply the database yourself, do: 1. Download [CARD](https://card.mcmaster.ca/latest/data) - ```bash - wget https://card.mcmaster.ca/latest/data - ``` +```bash +wget https://card.mcmaster.ca/latest/data +``` 2. Extract the (`.tar.bz2`) archive. - ```bash - tar -xjvf data - ``` +```bash +tar -xjvf data +``` You can then supply the path to resulting database directory with: @@ -396,10 +396,10 @@ To supply the database directories to the pipeline: 1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html) (To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release - check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`) - ```bash - conda create -n antismash-lite -c bioconda antismash-lite - conda activate antismash-lite - ``` +```bash +conda create -n antismash-lite -c bioconda antismash-lite +conda activate antismash-lite +``` 2. Run the following command `download-antismash-databases`. Use `--database-dir` to specify a new location. 3. You can then supply the paths to the resulting databases and the whole installation directory with: @@ -433,7 +433,7 @@ conda activate deepbgc export DEEPBGC_DOWNLOADS_DIR= deepbgc download -```` +``` You can then indicate the path to the database folder in the pipeline with `--bgc_deepbgc_db ///`. The folder has to contain the subfolders as in the database folder downloaded by `deepbgc download`: From 49b5519d0286a9148787ccacaaecd7fc8dc97d3d Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Wed, 14 Aug 2024 09:59:32 +0200 Subject: [PATCH 4/5] Add basic mmseqs --- docs/usage.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/usage.md b/docs/usage.md index 9df8a15f..472a328a 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -356,6 +356,25 @@ The flag `--save_db` saves the pipeline-downloaded databases in your results dir You can then move these to a central cache directory of your choice for re-use in the future. ::: +### MMSeqs2 + +To download MMSeqs2 databases for taxonomic classification, you can install `mmseqs` via conda: + +```bash +conda create -n mmseqs2 -c bioconda mmseqs2 +conda activate mmseqs2 +``` + +Then to download the database of your choice + +```bash +mmseqs databases tmp/ +``` + +:::info +You may want to specify a different location for `tmp/`, we just borrowed here from the official `mmseqs` [documentation](https://github.com/soedinglab/mmseqs2/wiki#downloading-databases) +::: + ### RGI RGI requires the database CARD which can be downloaded by nf-core/funcscan or supplied by the user manually. From 011fb56330ea0a765d27eb4eb553652a08f2003e Mon Sep 17 00:00:00 2001 From: Jasmin Frangenberg <73216762+jasmezz@users.noreply.github.com> Date: Wed, 21 Aug 2024 12:10:14 +0000 Subject: [PATCH 5/5] Apply suggestions from code review --- CHANGELOG.md | 2 +- docs/usage.md | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c1d6a356..6c4756b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -61,7 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#406](https://github.com/nf-core/funcscan/pull/406) Fixed prediction tools not being executed if annotation workflow skipped. (by @jasmezz) - [#407](https://github.com/nf-core/funcscan/pull/407) Fixed comBGC bug when parsing multiple antiSMASH files. (by @jasmezz) - [#409](https://github.com/nf-core/funcscan/pull/409) Fixed argNorm overwriting its output for DeepARG. (by @jasmezz, @jfy133) -- [#412](https://github.com/nf-core/funcscan/pull/412) Improve all pre-run database download documentation (by @jfy133) +- [#412](https://github.com/nf-core/funcscan/pull/412) Improve all pre-run database download documentation. (by @jfy133) ### `Dependencies` diff --git a/docs/usage.md b/docs/usage.md index b1707c3a..6c3c1088 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -171,10 +171,10 @@ You can do this by installing via conda and using the dedicated command conda create -n bakta -c bioconda bakta conda activate bakta -bakta_db download --output +bakta_db download --output --type ``` -Alternatively, you can manually download the filesvia the links of which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download). +Alternatively, you can manually download the files via the links which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download). Once downloaded this must be untarred: @@ -388,7 +388,7 @@ mmseqs databases tmp/ ``` :::info -You may want to specify a different location for `tmp/`, we just borrowed here from the official `mmseqs` [documentation](https://github.com/soedinglab/mmseqs2/wiki#downloading-databases) +You may want to specify a different location for `tmp/`, we just borrowed here from the official `mmseqs` [documentation](https://github.com/soedinglab/mmseqs2/wiki#downloading-databases). ::: ### RGI @@ -431,14 +431,14 @@ The same applies for the antiSMASH installation directory, which is also a requi To supply the database directories to the pipeline: -1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html) (To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release - check version in file `//funcscan/modules/nf-core/amrfinderplus/run/environment.yml`) +1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `//funcscan/modules/nf-core/antismash/antismashlite/environment.yml`). ```bash conda create -n antismash-lite -c bioconda antismash-lite conda activate antismash-lite ``` -2. Run the following command `download-antismash-databases`. Use `--database-dir` to specify a new location. +2. Run the command `download-antismash-databases`. Use `--database-dir` to specify a new location. 3. You can then supply the paths to the resulting databases and the whole installation directory with: ```bash