If you will be running the workflow on the Uppmax HPC clusters here are some helpful tips.
All resources required to run this workflow are automatically downloaded as needed when the workflow is executed. However, some of these (often large) files are already installed in a central location on the system at /sw/data
. This means that you can make use of them for your workflow runs thereby saving you time and reducing overall disk-usage for the system.
First create a resource directory inside the base dir of the cloned git repository:
mkdir resources
The 4.5.1
version of the eggNOG database is installed in a central location on Uppmax. However, this database was built with eggnog-mapper v1.0.3
and will not work with v2.0.1
that comes with this workflow. More on this below, but first symlink the necessary files into your resources/
directory.
mkdir resources/eggnog-mapper
ln -s /sw/data/eggNOG/4.5.1/eggnog.db resources/eggnog-mapper/eggnog.db
ln -s /sw/data/eggNOG/4.5.1/eggnog_proteins.dmnd resources/eggnog-mapper/eggnog_proteins.dmnd
head -1 /sw/data/eggNOG/eggNOG-4.5.1-install-README.md > resources/eggnog-mapper/eggnog.version
touch resources/eggnog-mapper/download.log
To change the version of eggnog-mapper that the workflow uses, edit the conda environment file at workflow/envs/annotation.yaml
so that it reads:
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- python=2.7.15
- prodigal=2.6.3
- pfam_scan=1.6
- eggnog-mapper=1.0.3 # <-- make sure this version is 1.0.3
- infernal=1.1.2
- trnascan-se=2.0.5
To use the Pfam database from the central location, create a pfam
sub-directory under resources
and link the necessary files from the central location, run the following:
mkdir resources/pfam
ln -s /sw/data/Pfam/31.0/Pfam-A.hmm* resources/pfam/
cat /sw/data/Pfam/31.0/Pfam.version > resources/pfam/Pfam-A.version
This installs the necessary files for release 31.0
. Check the directories under /sw/data/Pfam/
to see available releases.
For kraken
there are a number of databases installed under /sw/data/Kraken2
. Snapshots of the standard
, nt
, rdp
, silva
and greengenes
indices are installed on a monthly basis. To use the latest version of the standard index, do the following:
- Create a sub-directory and link the index files
mkdir -p resources/kraken/standard
ln -s /sw/data/Kraken2/latest/*.k2d resources/kraken/standard/
- From a reproducibility perspective it's essential to keep track of when the index was created, so generate a version file inside your kraken directory by running:
file /sw/data/Kraken2/latest | egrep -o "[0-9]{8}\-[0-9]{6}" > resources/kraken/standard/kraken.version
- Modify your config file so that it contains:
classification:
kraken: True
kraken:
# generate the standard kraken database?
standard_db: True
Non-standard databases
Using any of the other non-standard databases from the central location is also a simple process, e.g. for the latest SILVA index:
mkdir -p resources/kraken/silva
ln -s /sw/data/Kraken2/latest_silva/*.k2d resources/kraken/silva/
file /sw/data/Kraken2/latest_silva | egrep -o "[0-9]{8}\-[0-9]{6}" > resources/kraken/silva/kraken.version
Then update your config file with:
kraken:
standard_db: False
custom: "resources/kraken/silva"
There are several centrifuge indices on Uppmax located at /sw/data/Centrifuge-indices/20180720/
, but keep in mind that they are from 2018.
The available indices are: p_compressed
, p_compressed+h+v
, p+h+v
, and p+v
(see the centrifuge manual for information on what these contain).
To use the p_compressed
index, run:
mkdir -p resources/centrifuge/p_compressed/
ln -s /sw/data/Centrifuge-indices/20180720/p_compressed.*.cf resources/centrifuge/p_compressed/
Then update your config file to contain:
classification:
centrifuge: True
centrifuge:
custom: "resources/centrifuge/p_compressed/p_compressed"
To use the centrally installed Genome Taxonomy Database (GTDB) release on Uppmax, do:
mkdir -p resources/gtdb
ln -s /sw/data/GTDB/R04-RS89/rackham/release89/* resources/gtdb/
Then make sure your config file contains:
binning:
gtdbtk: True
Uppmax provides monthly snapshots of the nr
non-redundant database. While the formatted file cannot be used directly with the nbis-meta
workflow you can save time by making use of the already downloaded fasta file.
To use the latest snapshot of nr
for taxonomic annotation of contigs, do:
mkdir resources/nr
ln -s /sw/data/diamond_databases/Blast/latest/download/nr.gz resources/nr/nr.fasta.gz
Then update the taxonomy
section in your config file to use the nr
database:
taxonomy:
database: "nr"
The UniRef90 database is clustered at 90% sequence identity and Uppmax provides downloaded fasta files that can be used directly with the workflow.
To use the latest snapshot of UniRef90
for taxonomic annotation of contigs, do:
mkdir resources/uniref90
ln -s /sw/data/diamond_databases/UniRef90/latest/download/uniref90.fasta.gz resources/uniref90/uniref90.fasta.gz
Then update the taxonomy
section in your config file to use the uniref90
database:
taxonomy:
database: "uniref90"
The workflow comes with the SLURM snakemake profile pre-installed. All you have to do is to modify the config/cluster.yaml
file and insert your cluster account ID:
__default__:
account: staff # <-- exchange staff with your SLURM account id
Then you can run the workflow with --profile slurm
from the root of the git repo, e.g.:
snakemake --profile slurm -j 100 --configfile myconfig.yaml