To build, use
cmake ./
make install
(by default into /usr/local)
These are four simple utilities which perform the following manipulations and visualization tasks on GenBank taxonomic information.
gid-taxid
: convert a list of GenBank IDs and associated counts into the list of tripets: genbank id, taxonomy id, count.
It requires access to (quite large) mapping files maintained by GenBank ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
which are tab separated lists of gid taxid count
, e.g. the input line 160338813 160
is output as 160338813 436308 160
Try running it on as $gid-taxid tests/data/test.gid path/to/gi_taxid_nucl.dmp
The result should be as in tests/data/test.taxid
taxonomy-reader
: convert the output of gid-taxid
(i.e. gid taxid count triplets) into a fully expanded 22 level
taxonomy based on NCBI classification. The program requires access to the nodes.dmp and names.dmp files which match taxid
data to scientific names and define the taxonomic hierarchy ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.
Try running
$taxonomy-reader path/to/names.dmp path/to/nodes.dmp
and entering 160338813 436308 160
on the command line.
The output should be
160 436308 root Archaea n n n Thaumarchaeota n n n n n Nitrosopumilales n n Nitrosopumilaceae n n n Nitrosopumilus n Nitrosopumilus maritimus n
Typical usage involves piping the output file generated by git-taxid
to taxonomy-reader
, .e.g
cat test/data/test.taxid | $taxonomy-reader path/to/names.dmp path/to/nodes.dmp > test/data/test.taxonomy
##Convert taxonomic rankings into a tree and a text summary##
taxonomy2tree
takes the output of taxonomy-reader
and converts it to 2 outputs: a Newick tree file representing the
hierarchical taxonomy and a summary file.
Try running
$taxonomy2tree test/data/test.taxonomy 0 test/data/test.tree test/data/test_summary.txt 0
The output tree uses the standard Newick format with "branch lengths" representing samples representing the given taxonomic group.
((((((Nitrosopumilus maritimus:162)Nitrosopumilus:162)Nitrosopumilaceae:162)Nitrosopumilales:162,uncultured crenarchaeote 74A4:...
The output summary file is simply a tab-separated count:
root root 10401
superkingdom Archaea 295
superkingdom Bacteria 9469
superkingdom Eukaryota 553
superkingdom Viruses 16
kingdom Fungi 100
kingdom Metazoa 231
kingdom Viridiplantae 110
subkingdom Dikarya 97
...
tree2ps
takes the Newick tree output of taxonomy2tree
and converts it to a PostScript rendering subject to a variety of
conditions.
The program arguments are as follows
- Newick tree file
- The file to write PostScript to
- Maximum taxonomic depth -- only show leaves this many or fewer steps away from the root. Use 0 or a negative number to show all levels.
- font size (in points)
- Maximum number of leaves -- display the tree up to the depth level (see 3) which has this many or fewer leaves.
- Count duplicate tax ids -- this is used for coloring the tree; if set to 0, only count the number of leaves below each node, ignoring the counts associated with the leaves themselves.
Try
$tree2ps test/data/test.tree test/data/tree1.ps 5 8 0 1
$tree2ps test/data/test.tree test/data/tree2.ps 0 8 256 1
$tree2ps test/data/test.tree test/data/tree3.ps 0 8 256 0