Skip to content

4. Extracting taxonomic information from MitoFish and NCBI

shenjean edited this page Mar 16, 2022 · 16 revisions

Update:

  • The commands below were used to create the initial reference dataset.
  • However, RESCRIPt is more efficient and accurate in retrieving taxonomy information from specific ranks
  • RESCRIPt commands are in Section 8

Old commands:

From complete + partial mDNA sequence file

  • Use R taxonomizr package to map taxIDs to taxonomic names. See below for R code (not included in repository) - save it as taxonomizr.R.

Note: Records that the software were unable to map can be marked in the output as "NA". Any classification marked as "NA" is manually reviewed and corrected where necessary. Nevertheless, there are indeed some fish species with unassigned class/order/family/genus/species. In this case, the "NA" annotation is kept.

library(taxonomizr)
# Prepare the taxonomy database. This will only have to be done once. 
# Note it requires a lot of hard drive space, bandwidth and time to process all the data from NCBI
prepareDatabase('accessionTaxa.sql')
input=read.delim("complete.partial.accession.sorted",header=F)
acc=as.vector(input[,1])
# First, get list of taxa IDs
taxIDs=accessionToTaxa(acc,'accessionTaxa.sql',version='base')
# Then, get taxonomic names
taxa=getTaxonomy(taxIDs,'accessionTaxa.sql')
# Write output to table
write.table(taxa,'complete.partial.taxa.tsv',sep="\t")
  • Run the R code:
Rscript taxonomizr.R
  • Contents of complete.partial.taxa.tsv:
"superkingdom"  "phylum"        "class" "order" "family"        "genus" "species"
"   8255"       "Eukaryota"     "Chordata"      "Actinopteri"   "Pleuronectiformes"     "Paralichthyidae"       "Paralichthys"  "Paralichthys olivaceus"
"   8255"       "Eukaryota"     "Chordata"      "Actinopteri"   "Pleuronectiformes"     "Paralichthyidae"       "Paralichthys"  "Paralichthys olivaceus"
"   8255"       "Eukaryota"     "Chordata"      "Actinopteri"   "Pleuronectiformes"     "Paralichthyidae"       "Paralichthys"  "Paralichthys olivaceus"
  • Remove quotes and fix header of complete.partial.taxa.tsv. The first column containing NCBI taxonomy IDs will be named taxid:
cat complete.partial.taxa.tsv  | grep -v superkingdom | tr -d "\"" >complete.partial.noheader.taxtable
# Create new header
echo -e "taxid\tSuperkingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies" >tax.header
cat tax.header complete.partial.noheader.taxtable >complete.partial.taxtable
  • Contents of complete.partial.taxtable:
taxid    Superkingdom    Phylum  Class   Order   Family  Genus   Species
135755  Eukaryota       Chordata        Actinopteri     Centrarchiformes        Percichthyidae  Gadopsis        Gadopsis marmoratus
1581706 Eukaryota       Chordata        Actinopteri     Gobiiformes     Gobiidae        Periophthalmus  Periophthalmus minutus
36177   Eukaryota       Chordata        Actinopteri     Acipenseriformes        Acipenseridae   Acipenser       Acipenser oxyrinchus
8240    Eukaryota       Chordata        Actinopteri     Scombriformes   Scombridae      Thunnus Thunnus maccoyii

From complete mitogenome records

grep ">" mitofish/*.fa mitofish/duplicates/*.fa | awk -F "|" '{print $6}' >complete.full.taxIDs
  • Use R taxonimzr package to map NCBI taxonomy IDs to taxonomic names. See below for R code (not included in repository):
library(taxonomizr)
# Mapping taxonomy IDs to taxonomic names
taxID=read.delim("complete.full.taxIDs",header=F)
taxIDs=taxID[,1]
taxa=getTaxonomy(taxIDs,'accessionTaxa.sql')
write.table(taxa,"complete.full.taxa.tsv",sep="\t")
  • Contents of complete.full.taxa.tsv:
"superkingdom"  "phylum"        "class" "order" "family"        "genus" "species"
"   8038"       "Eukaryota"     "Chordata"      "Actinopteri"   "Salmoniformes" "Salmonidae"    "Salvelinus"    "Salvelinus fontinalis"
"   8036"       "Eukaryota"     "Chordata"      "Actinopteri"   "Salmoniformes" "Salmonidae"    "Salvelinus"    "Salvelinus alpinus"
"  79736"       "Eukaryota"     "Chordata"      "Chondrichthyes"        "Carcharhiniformes"     "Triakidae"     "Mustelus"      "Mustelus manazo"
" 386614"       "Eukaryota"     "Chordata"      "Chondrichthyes"        "Rajiformes"    "Rajidae"       "Amblyraja"     "Amblyraja radiata"
  • Remove quotates and fix header of complete.taxa.tsv. The first column containing NCBI taxonomy IDs will be named txid:
cat complete.full.taxa.tsv  | grep -v superkingdom | tr -d "\"" >complete.full.noheader.taxtable
cat tax.header complete.full.noheader.taxtable >complete.full.taxtable
  • Contents of complete.full.taxtable:
taxid    Superkingdom    Phylum  Class   Order   Family  Genus   Species
8038    Eukaryota       Chordata        Actinopteri     Salmoniformes   Salmonidae      Salvelinus      Salvelinus fontinalis
8036    Eukaryota       Chordata        Actinopteri     Salmoniformes   Salmonidae      Salvelinus      Salvelinus alpinus
79736   Eukaryota       Chordata        Chondrichthyes  Carcharhiniformes       Triakidae       Mustelus        Mustelus manazo