Skip to content

Taxonomic classification with deep neural networks

Andrew Tritt edited this page Oct 14, 2021 · 2 revisions

The deep-taxon repository provides command-line utilities for running training deep neural networks to predict microbial taxonomy.

Preparing data

Training neural networks requires iterating over a training dataset, processing each sample multiple times. If your data does not fit into memory, this requires reading the same data from disk multiple times--potentially thousands of times--throughout the course of training. For this reason, having data stored efficiently for reading is imperative.

To this end, deep-taxon uses HDF5 files organized using the HDMF framework. The deep-taxon package contains all the code necessary for converting data and reading it during training. This code assumes you are converting data from the Genome Taxonomy Database (GTDB) and that genomes have been download from the NCBI genomes ftp site with the directory structure preserved.

Downloading data

Before to convert data, you will need to download some data.

The first set of data you need are two files from here:

  • ar122.tree or bac120.tree - this is the tree file. It contains the species tree built by GTDB
  • ar122_metadata.tar.gz or bac120_metadata.tar.gz - this is the metadata file. It contains the NCBI Assembly accessions and GTDB taxonomy for each species.

The ar122 prefix is for archaea data, and the bac120 prefix is for bacteria data.

Finally, you will need to download the actual genome sequences from NCBI. You will need to download all the accessions from the GTDB metadata file. They can be found in the all/GCA and all/GCF directories. You can download these files using the deep-taxon ncbi-fetch command. The following is an example of how you can do that using the GTDB metadata file.

$ mkdir ncbi_genomes
$ deep-taxon ncbi-fetch --metadata ar122_metadata.csv ncbi_genomes 

The --metadata flag tells ncbi-fetch that the first argument passed in is a GTDB metadata file and to download all genomes found in the metadata file. To speed things up, you can run this with multiple processes using the -p flag. This will just launch multiple processes that all call rsync in parallel. For additional options, deep-taxon ncbi-fetch -h will display a full usage statement.

Converting data to an HDMF file

Once you have downloaded all the data, you can now convert the genome sequences and necessary taxonomic metadata into an HDMF file for training. To do this, you will use the deep-taxon prepare-data command.

$ deep-taxon prepare-data --rep --genomic ncbi_genomes ar122_metadata.csv ar122.tree ar122.input.h5

Here are the following arguments in order:

  • --rep: GTDB has done the hard work of identifying the best species representatives for species that have multiple strains sequenced. This tells deep-taxon prepare-data to only convert GTDB representatives. Alternatively, you can convert non-representive genomes using the --nonrep flag. This is handy when building a file with testing data.
  • --genomic: This tells deep-taxon prepare-data to convert whole genome data. NCBI Assembly includes coding sequences and proteins in addition to whole genomes. They can be converted with the --cds and --protein flags, respectively.
  • ncbi_genomes: This is the first positional argument. This is the directory that NCBI genomes are stored in.
  • ar122_metadata.csv: This is the second positional argument. This is the GTDB metadata file.
  • ar122.tree: This is the third positional argument. This is the GTDB tree file.
  • ar122.input.h5: This is the fourth positional argument. This is the path to the file to write to.

A useful option is the -A/--accessions flag. You can pass in a list of accession to convert (i.e. a file with accession per line). Use this for making small test datasets. Additional arguments are available with deep-taxon prepare-data -h.

Clone this wiki locally