Skip to content

Preparing the database

John Sundh edited this page Nov 25, 2020 · 10 revisions

Downloading a protein database

usage: contigtax download [-h] [-d DLDIR] [--tmpdir TMPDIR] [-t TAXDIR] [-f]
                      [--skip_check]
                      {uniref100,uniref90,uniref50,nr,taxonomy,idmap}
```t

contigtax supports the download of the [uniref](https://www.uniprot.org/help/uniref) clustered databases (`uniref50`, `uniref90` and `uniref100`) as well as the NCBI non-redundant `nr` database. To download the uniref100 database run:

contigtax download uniref100

This will download the file `uniref100.fasta.gz` to directory `uniref100/`.

To download the taxonomy files needed do:

contigtax download taxonomy


If you want to use the `nr` database you have to download the protein id to taxonomy map file separately:

contigtax download idmap


### Positional arguments
`uniref{n}`: Download fastafile of uniref{n} database where n is 50, 90 or 100.

`nr`: Download fastafile for the nr database.

`taxonomy`: Download taxonomic info files. This includes the NCBI taxdump.tar.gz file as well as creating the ete3 sqlite database.

`idmap`: Download the nr prot.accession2taxid.gz file.

### Optional arguments
`-d DLDIR`: By default, the databases are downloaded into a subdirectory in your current worker. Use the `-d` flag to specify another directory to store the files.

`--tmpdir TMPDIR`: Set a temporary directory to download files to. When download is finished the files will be moved to `DLDIR`. On a cluster this can be used to specify a temporary destination on a local drive.

`-t TAXDIR`: Directory to store NCBI taxonomy files and the ete3 sqlite database. This defaults to `taxonomy/` in your current directory.

`-f`: Force download of files.

`--skip_check`: After files are downloaded their integrity are checked. To disable this check use the `--skip_check` flag.

## Format the fastafile
Before creating the diamond database the fasta file has to be reformatted. For UniRef files this entails extracting protein id to taxonomy id mappings from the fasta records. In addition, diamond has a hardcoded protein id length limit of 14 characters so the reformatting stage also includes a check for protein ids longer than this limit. If such ids are found they are remapped to shorter ids.

Continuing with the uniref100 database as an example you can format it using:

contigtax format uniref100/uniref100.fasta.gz uniref100/uniref100.reformat.fasta.gz


### Optional arguments

usage: contigtax format [-h] [-f] [--forceidmap] [-m TAXIDMAP] [--maxidlen MAXIDLEN] [--tmpdir TMPDIR] fastafile reformatted


`-f`: Force overwrite of existing reformatted fast file

`--forceidmap`: Force overwrite of existing idmap file

`-m TAXIDMAP`: Specify filename for protein id to taxonomy id mapping file. Defaults to `prot.accession2taxid.gz` in the same directory as the reformatted fasta file.

`--maxidlen MAXIDLEN`: Maximum allowed length of protein ids. Defaults to 14 which is the limit set by diamond when building a database with taxonomy information.

`--tmpdir TMPDIR`: Set a temporary directory to write files to. On a cluster this can be used to specify a temporary destination on a local drive.

## Build the diamond database

usage: contigtax build [-h] [-d DBFILE] [-p THREADS] fastafile taxonmap taxonnodes

To build a diamond database from the reformatted fastafile, including taxonomic information for each sequence, run:

contigtax build uniref100/uniref100.reformat.fasta.gz uniref100/prot.accession2taxid.gz taxonomy/nodes.dmp


### Positional arguments
`fastafile`: Path to reformatted fastafile (from `contigtax format` step).

`taxonmap`: Path to protein id to taxonomy id map file.

`taxonnodes`: Path to nodes.dmp file (from `contigtax download taxonomy` step).
### Optional arguments
`-d DBFILE`: Name of diamond database. This defaults to `diamond.dmnd` in the same directory as the input fastafile

`-p THREADS`: Number of threads for `diamond makedb`. Defaults to 1.
Clone this wiki locally