Skip to content

8. Updating reference databases

shenjean edited this page Aug 8, 2024 · 62 revisions

In a new folder, download updated version of MitoFish database according to workflow outlined in Section 1.

Get accession numbers

  • Get accession number of complete mitogenomes

The list of complete mitogenomes has not been updated since Dec 2021, so they don't have to be processed unless there are new genomes

mkdir mitogenomes
cd mitogenomes
wget http://mitofish.aori.u-tokyo.ac.jp/files/mitogenomes.zip
unzip mitogenomes.zip
ls *.fa | cut -d "_" -f1-2 | grep -v complete | sort | uniq >../complete.new.accession
  • Get accession number of complete+partial mitogenomes
cd ..
wget http://mitofish.aori.u-tokyo.ac.jp/files/complete_partial_mitogenomes.zip
unzip complete_partial_mitogenomes.zip
grep ">" mito-all | cut -d "|" -f2 | sort | uniq >complete.partial.new.accession

Combine and sort accession numbers - skip if there are no new complete mitogenomes

cat complete.new.accession complete.partial.new.accession | sort | uniq >mitofish.new.accession

Get accession numbers of current dataset

cut -f1 mitofish.old.tsv | grep -v "Accession" | sort | uniq >mitofish.old.accession

Get differences between new and old accession numbers

  • Get list of newly added accession numbers using comm
comm -23 mitofish.new.accession mitofish.old.accession >new.accession
  • Get list of accession numbers missing in the new MitoFish version. These records may either just not be present in the MitoFish database, or have been removed by the submitter due to redundancies etc.
comm -13 mitofish.new.accession mitofish.old.accession >missing.accession
  • Check if line numbers add up
wc -l mitofish.new.accession mitofish.old.accession new.accession missing.accession

Get NCBI gene and taxonomic information from new records

  • Get gene definitions and sequences using NCBI's efetch Unix tool bundled within Entrez Direct

To install the EDirect software, wget ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh and execute

for i in `cat new.accession`
do
echo $i

# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>new.definition.tsv

# Get Taxonomy ID
echo "#$i," >>new.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>new.taxid.tsv

done

# Sort output by accession number
sort -k 1 new.definition.tsv | uniq | tr "#" "_" >new.definition.sorted.tsv
cat new.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >new.taxid.sorted.tsv

Note: efetch works rather quickly and has the advantage of being able to retrieve unverified NCBI records. However, it does not output class information and uniform number of columns for taxonomy

  • Run efetch to retrieve gene definitions for missing.accession

This is to check whether missing accessions are unverified/deleted records or just not included in MitoFish

for i in `cat missing.accession`
do
echo $i
# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>missing.definition.tsv

Look through missing.definition.tsv to check for UNVERIFIED records. Then, check whether line numbers of missing.accession and missing.definition.tsv match. If they match, that means these "missing" records are not removed from NCBI.

wc -l missing.accession missing.definition.tsv

Scenario #1: If missing.accession and missing.definition.tsv have the same line numbers (see example output below), this means that these "missing" records are not removed from NCBI. They are simply not present in the MitoFish database

  4678 missing.accession
  4678 missing.definition.tsv
  9356 total

Scenario #2: If missing.definition.tsv contains less number of lines than missing.accession, this means that some records have indeed been removed from NCBI. Using the commands below, we can extract these "real" missing accession numbers to remove them from the latest mitohelper release.

# Get accession numbers from missing.definition.tsv
cut -f1 missing.definition.tsv >notmissing.acc
grep -v -f notmissing.acc missing.accession >realmissing.accession
  • We use RESCRIPt to retrieve sequence and taxonomy information:

Nowadays, RESCRIPt is not perfectly synced with the most current NCBI data, resulting in records with no taxonomic information. Batches with missing data leads to plugin errors where RESCRIPt is unable to complete partial downloads. Because of this, records with missing data have to be sequentially removed from the accession list and RESCRIPt will have to be re-run multiple times with increasingly smaller batch sizes (specified in the split -l command below).

# Split accession number into smaller files of 1,000 lines each
# RESCRIPt requires a header for input accession ids 
echo "id" >id
split -l 1000 new.accession
for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza --p-n-jobs 5
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
qiime tools export --input-path "$i".seq.qza --output-path "$i".seqout
done    
  • Format sequences of new records from RESCRIPt output
cat *seqout/dna-sequences.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k 1 | uniq | grep -v ^$ >new.seq.sorted.tsv
  • Convert taxonomy to mitohelper's TSV format
cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 | uniq >new.tax.sorted.tsv
  • Use efetch to retrieve sequences of records that cannot be downloaded by RESCRIPt. These sequences can be added to new.seq.sorted.tsv. Taxonomy data can be downloaded, manually processed, and added to new.tax.sorted.tsv.
for i in `cat rescript.missing.acc`
do
echo $i
# Get sequence data
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_sequence >>rescript.missing.seq.tsv

# Get taxonomy data - this is not formatted like the mitohelper's datasets and should be manually formatted in Excel or other software
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_taxonomy INSDSeq_organism >>rescript.missing.tax.tsv
done

Format retrieved sequence data to match Mitohelper's format - replace DNA bases in lower case to upper case

cat rescript.missing.seq.tsv | tr '[:lower:]' '[:upper:]' >>new.seq.sorted.tsv

Another way to retrieve the gene definitions is to follow the instructions in Section 2 and Section 3. This is efficient if you have a large number of gene definitions to retrieve.

Update reference database

cat new.tax.sorted.tsv | tr -d '\r' >new.taxonomy.sorted.tsv
  • Combine new gene definition list, taxonomy information, and sequences into a tab-separated table according to the reference file format. Also, remove carriage return ^M characters. Filter out UNVERIFIED records.
paste -d "\t" new.definition.sorted.tsv new.taxid.sorted.tsv new.taxonomy.sorted.tsv new.seq.sorted.tsv \
| awk -F "\t" '{OFS="\t"}{print $1,$2,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13,$14}'| tr -d '\r' \
| grep -v UNVERIFIED | sort | uniq >new.records.tsv
  • Append table with new records to previous reference file
cat mitofish.old.tsv new.records.tsv >mitofish.add.tsv
  • Filter out records that were removed, if present
grep -v -f realmissing.accession mitofish.add.tsv >mitofish."$month".xxx.tsv