-
Notifications
You must be signed in to change notification settings - Fork 2
Taxonomic classification with deep neural networks
The deep-taxon repository provides command-line utilities for running training deep neural networks to predict microbial taxonomy.
Training neural networks requires iterating over a training dataset, processing each sample multiple times. If your data does not fit into memory, this requires reading the same data from disk multiple times--potentially thousands of times--throughout the course of training. For this reason, having data stored efficiently for reading is imperative.
To this end, deep-taxon uses HDF5 files organized using the HDMF framework. The deep-taxon package contains all the code necessary for converting data and reading it during training. This code assumes you are converting data from the Genome Taxonomy Database (GTDB) and that genomes have been download from the NCBI genomes ftp site with the directory structure preserved.
Before to convert data, you will need to download some data.
The first set of data you need are two files from here:
-
ar122.tree
orbac120.tree
- this is the tree file. It contains the species tree built by GTDB -
ar122_metadata.tar.gz
orbac120_metadata.tar.gz
- this is the metadata file. It contains the NCBI Assembly accessions and GTDB taxonomy for each species.
The ar122
prefix is for archaea data, and the bac120
prefix is for bacteria data.
Finally, you will need to download the actual genome sequences from NCBI. You will need to download all the accessions from the GTDB metadata file. They can be found in the all/GCA
and all/GCF
directories. You can download these files using the deep-taxon ncbi-fetch
command. The following is an example of how you can do that using the GTDB metadata file.
$ mkdir ncbi_genomes
$ deep-taxon ncbi-fetch --metadata ar122_metadata.csv ncbi_genomes
The --metadata
flag tells ncbi-fetch
that the first argument passed in is a GTDB metadata file and to download all genomes found in the metadata file. To speed things up, you can run this with multiple processes using the -p
flag. This will just launch multiple processes that all call rsync
in parallel. For additional options, deep-taxon ncbi-fetch -h
will display a full usage statement.
Once you have downloaded all the data, you can now convert the genome sequences and necessary taxonomic metadata into an HDMF file for training. To do this, you will use the deep-taxon prepare-data
command.
$ deep-taxon prepare-data --rep --genomic ncbi_genomes ar122_metadata.csv ar122.tree ar122.input.h5
Here are the following arguments in order:
-
--rep
: GTDB has done the hard work of identifying the best species representatives for species that have multiple strains sequenced. This tellsdeep-taxon prepare-data
to only convert GTDB representatives. Alternatively, you can convert non-representive genomes using the--nonrep
flag. This is handy when building a file with testing data. -
--genomic
: This tellsdeep-taxon prepare-data
to convert whole genome data. NCBI Assembly includes coding sequences and proteins in addition to whole genomes. They can be converted with the--cds
and--protein
flags, respectively. -
ncbi_genomes
: This is the first positional argument. This is the directory that NCBI genomes are stored in. -
ar122_metadata.csv
: This is the second positional argument. This is the GTDB metadata file. -
ar122.tree
: This is the third positional argument. This is the GTDB tree file. -
ar122.input.h5
: This is the fourth positional argument. This is the path to the file to write to.
A useful option is the -A/--accessions
flag. You can pass in a list of accession to convert (i.e. a file with accession per line). Use this for making small test datasets. Additional arguments are available with deep-taxon prepare-data -h
.