Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database preparation docs improvements #412

Merged
merged 6 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#406](https://github.com/nf-core/funcscan/pull/406) Fixed prediction tools not being executed if annotation workflow skipped. (by @jasmezz)
- [#407](https://github.com/nf-core/funcscan/pull/407) Fixed comBGC bug when parsing multiple antiSMASH files. (by @jasmezz)
- [#409](https://github.com/nf-core/funcscan/pull/409) Fixed argNorm overwriting its output for DeepARG. (by @jasmezz, @jfy133)
- [#412](https://github.com/nf-core/funcscan/pull/412) Improve all pre-run database download documentation. (by @jfy133)

### `Dependencies`

Expand Down
117 changes: 100 additions & 17 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,19 @@ As a reference, we will describe below where and how you can obtain databases an

nf-core/funcscan offers multiple tools for annotating input sequences. Bakta is a new tool touted as a bacteria-only successor to the well-established Prokka.

To supply the preferred Bakta database (and not have the pipeline download it for every new run), use the flag `--annotation_bakta_db`. The full or light Bakta database must be downloaded from the Bakta Zenodo archive, the link of which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download).
To supply the preferred Bakta database (and not have the pipeline download it for every new run), use the flag `--annotation_bakta_db`.
The full or light Bakta database must be downloaded from the Bakta Zenodo archive.

You can do this by installing via conda and using the dedicated command

```bash
conda create -n bakta -c bioconda bakta
conda activate bakta

bakta_db download --output <LOCATION_TO_STORE> --type <full|light>
```

Alternatively, you can manually download the files via the links which can be found on the [Bakta GitHub repository](https://github.com/oschwengers/bakta#database-download).

Once downloaded this must be untarred:

Expand All @@ -186,7 +198,8 @@ The flag `--save_db` saves the pipeline-downloaded databases in your results dir

nf-core/funcscan allows screening of sequences for functional genes associated with various natural product types via Hidden Markov Models (HMMs) using hmmsearch.

This requires supplying a list of HMM files ending in `.hmm`, that have models for the particular molecule(s) or BGCs you are interested in. You can download these files from places such as [PFAM](https://www.ebi.ac.uk/interpro/download/Pfam/) for antimicrobial peptides (AMP), or the antiSMASH GitHub repository for [biosynthetic gene cluster](https://github.com/antismash/antismash/tree/master/antismash/detection/hmm_detection/data) related HMMs, or create them yourself.
This requires supplying a list of HMM files ending in `.hmm`, that have models for the particular molecule(s) or BGCs you are interested in.
You can download these files from places such as [PFAM](https://www.ebi.ac.uk/interpro/download/Pfam/) for antimicrobial peptides (AMP), or the antiSMASH GitHub repository for [biosynthetic gene cluster](https://github.com/antismash/antismash/tree/master/antismash/detection/hmm_detection/data) related HMMs, or create them yourself.

You should place all HMMs in a directory, supply them to the AMP or BGC workflow and switch hmmsearch on:

Expand All @@ -200,7 +213,19 @@ Ensure to wrap this path in double quotes if using an asterisk, to ensure Nextfl

### AMPcombi

For AMPcombi, nf-core/funcscan will by default download the most recent version of the [DRAMP](http://dramp.cpu-bioinfor.org/) database as a reference database for aligning the AMP hits in the AMP workflow. However, the user can also supply their own custom AMP database by following the guidelines in [AMPcombi](https://github.com/Darcy220606/AMPcombi). This can then be passed to the pipeline with:
For AMPcombi, nf-core/funcscan will by default download the most recent version of the [DRAMP](http://dramp.cpu-bioinfor.org/) database as a reference database, and modifies the files for aligning the AMP hits in the AMP workflow.

nf-core/funcscan currently provides a python3 helper script to do these steps.

```bash
mkdir -p ampcombi/amp_ref_database
cd ampcombi/
wget https://github.com/nf-core/funcscan/raw/<PIPELINE_VERSION>/bin/ampcombi_download.py
python3 ampcombi_download.py
```

However, the user can also supply their own custom AMP database by following the guidelines in [AMPcombi](https://github.com/Darcy220606/AMPcombi).
This can then be passed to the pipeline with:

```bash
--amp_ampcombi_db '/<path>/<to>/<ampcombi_database>
Expand All @@ -209,7 +234,8 @@ For AMPcombi, nf-core/funcscan will by default download the most recent version
The contents of the directory should have files such as `*.dmnd` and `*.fasta` in the top level.

:::warning
The pipeline will automatically run Pyrodigal instead of Prodigal if the parameters `--run_annotation_tool prodigal --run_amp_screening` are both provided. This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools.
The pipeline will automatically run Pyrodigal instead of Prodigal if the parameters `--run_annotation_tool prodigal --run_amp_screening` are both provided.
This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools.
:::

### Abricate
Expand Down Expand Up @@ -265,8 +291,16 @@ You must give the `latest` directory to the pipeline, and the contents of the di

To obtain a local version of the database:

1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `<installation>/<path>/funcscan/modules/nf-core/amrfinderplus/run/environment.yml`).
2. Run `amrfinder --update`, which will download the latest version of the AMRFinderPlus database to the default location (location of the AMRFinderPlus binaries/data). It creates a directory in the format YYYY-MM-DD.version (e.g., `<installation>/<path>/data/2024-01-31.1/`).
1. Install AMRFinderPlus from [bioconda](https://bioconda.github.io/recipes/ncbi-amrfinderplus/README.html?highlight=amrfinderplus).
To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `<installation>/<path>/funcscan/modules/nf-core/amrfinderplus/run/environment.yml`).

```bash
conda create -n amrfinderplus -c bioconda ncbi-amrfinderplus=3.12.8
conda activate amrfinderplus
```

2. Run `amrfinder --update`, which will download the latest version of the AMRFinderPlus database to the default location (location of the AMRFinderPlus binaries/data).
It creates a directory in the format YYYY-MM-DD.version (e.g., `<installation>/<path>/data/2024-01-31.1/`).

<details markdown="1">
<summary>AMR related files in the database folder</summary>
Expand Down Expand Up @@ -307,6 +341,12 @@ nf-core/funcscan can download this database for you, however it is very slow and
You can either:

1. Install DeepARG from [bioconda](https://bioconda.github.io/recipes/deeparg/README.html?highlight=deeparg)

```bash
conda create -n deeparg -c bioconda deeparg
conda activate deeparg
```

2. Run `deeparg download_data -o /<path>/<to>/<database_location>/`

Or download the files directly from
Expand All @@ -322,21 +362,51 @@ You can then supply the path to resulting database directory with:
--arg_deeparg_db '/<path>/<to>/<deeparg>/<db>/'
```

The contents of the directory should include directories such as `database`, `moderl`, and files such as `deeparg.gz` etc. in the top level.
The contents of the directory should include directories such as `database`, `model`, and files such as `deeparg.gz` etc. in the top level.

Note that if you supply your own database that is not downloaded by the pipeline, make sure to also supply `--arg_deeparg_db_version` along
with the version number so hAMRonization will correctly display the database version in the summary report.

:::info
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
The flag `--save_db` saves the pipeline-downloaded databases in your results directory.
You can then move these to a central cache directory of your choice for re-use in the future.
:::

### MMSeqs2

To download MMSeqs2 databases for taxonomic classification, you can install `mmseqs` via conda:

```bash
conda create -n mmseqs2 -c bioconda mmseqs2
conda activate mmseqs2
```

Then to download the database of your choice

```bash
mmseqs databases <DATABASE_NAME> <LOCATION_TO_STORE> tmp/
```

:::info
You may want to specify a different location for `tmp/`, we just borrowed here from the official `mmseqs` [documentation](https://github.com/soedinglab/mmseqs2/wiki#downloading-databases).
:::

### RGI

RGI requires the database CARD which can be downloaded by nf-core/funcscan or supplied by the user manually. To download and supply the database yourself, do:
RGI requires the database CARD which can be downloaded by nf-core/funcscan or supplied by the user manually.
To download and supply the database yourself, do:

1. Download [CARD](https://card.mcmaster.ca/latest/data)
2. Extract the archive.

```bash
wget https://card.mcmaster.ca/latest/data
```

2. Extract the (`.tar.bz2`) archive.

```bash
tar -xjvf data
```

You can then supply the path to resulting database directory with:

Expand All @@ -347,7 +417,8 @@ You can then supply the path to resulting database directory with:
The contents of the directory should include files such as `card.json`, `aro_index.tsv`, `snps.txt` etc. in the top level.

:::info
The flag `--save_db` saves the pipeline-downloaded databases in your results directory. You can then move these to a central cache directory of your choice for re-use in the future.
The flag `--save_db` saves the pipeline-downloaded databases in your results directory.
You can then move these to a central cache directory of your choice for re-use in the future.
:::

### antiSMASH
Expand All @@ -360,15 +431,22 @@ The same applies for the antiSMASH installation directory, which is also a requi

To supply the database directories to the pipeline:

1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html)
2. Run `download-antismash-databases`
1. Install antiSMASH from [bioconda](https://bioconda.github.io/recipes/antismash-lite/README.html). To ensure database compatibility, please use the same version as is used in your nf-core/funcscan release (check version in file `<pipeline_installation>/<path>/funcscan/modules/nf-core/antismash/antismashlite/environment.yml`).

```bash
conda create -n antismash-lite -c bioconda antismash-lite
conda activate antismash-lite
```

2. Run the command `download-antismash-databases`. Use `--database-dir` to specify a new location.
3. You can then supply the paths to the resulting databases and the whole installation directory with:

```bash
--bgc_antismash_db '/<path>/<to>/<antismash>/<db>/'
--bgc_antismash_installdir '/<path>/<to>/<antismash>/<dir>/antismash'
```

Note that the names of the supplied folders must differ from each other (e.g. `antismash_db` and `antismash_dir`).
The contents of the database directory should include directories such as `as-js/`, `clusterblast/`, `clustercompare/` etc. in the top level.
The contents of the installation directory should include directories such as `common/` `config/` and files such as `custom_typing.py` `custom_typing.pyi` etc. in the top level.

Expand All @@ -385,15 +463,20 @@ The flag `--save_db` saves the pipeline-downloaded databases in your results dir

### DeepBGC

DeepBGC relies on trained models and Pfams to run its analysis. nf-core/funcscan will download these databases for you. If the flag `--save_db` is set, the downloaded files will be stored in the output directory under `databases/deepbgc/`.
DeepBGC relies on trained models and Pfams to run its analysis.
nf-core/funcscan will download these databases for you. If the flag `--save_db` is set, the downloaded files will be stored in the output directory under `databases/deepbgc/`.

Alternatively, if you already downloaded the database locally with `deepbgc download`, you can indicate the path to the database folder with:
Alternatively, you can download the database locally with:

```bash
--bgc_deepbgc_db <path>/<to>/<deepbgc_db>/
conda create -n deepbgc -c bioconda deepbgc
conda activate deepbgc
export DEEPBGC_DOWNLOADS_DIR=<PREFERRED_CACHE_DIRECTORY>
deepbgc download
```

The contents of the database directory should include directories such as `common`, `0.1.0` in the top level.
You can then indicate the path to the database folder in the pipeline with `--bgc_deepbgc_db <path>/<to>/<deepbgc_db>/`.
The contents of the database directory should include directories such as `common`, `0.1.0` in the top level:

```console
deepbgc_db/
Expand Down
Loading