Skip to content

5. Extracting sequences and creating merged tables

shenjean edited this page Jan 21, 2021 · 16 revisions

From complete_partial_mitogenomes.fa

  • Extract accession numbers and sequences and combine into a tab-separated table with header:
cat complete_partial_mitogenomes.fa | awk -F "|" '{OFS="%"}{print $1,$2}' | sed "s/%$//" | sed "s/>.*%/>/" | sed "s/>.*$/&#/"  | tr -d "\n" | tr ">" "\n" | tr "#" "\t" | grep -v ^$ >complete.partial.seqtable
  • Format of complete.partial.seqtable:
LM993800        TATTCCGAACAAACTAGGCGGAGTACTGGCCCTTCTATTCTCTATTCTAGTCCTAATACTGGTACCAGTCCTC
  • Sort complete.partial.seqtable by first column (accession number):
cat complete.partial.seqtable | sort -k 1 >complete.partial.seqtable.sorted
# Add header line
echo -e Accession'\t'Sequence >complete.seq.header
cat complete.seq.header complete.partial.seqtable.sorted >complete.partial.seq.tsv
  • Merge accession number, gene description, taxonomic information and sequence into a tab-separated table complete.partial.ref.tsv:
paste -d "\t" complete.partial.gene.tsv complete.partial.taxtable complete.partial.seq.tsv | awk -F "\t" '{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12}' >complete.partial.ref.tsv
  • Format of complete.partial.ref.tsv with sequence truncated for readability:
Accession       Gene definition taxid    Superkingdom    Phylum  Class   Order   Family  Genus   Species Sequence
AB000667        Paralichthys olivaceus mitochondrial Cyt-b gene for cytochrome b, partial cds   8255    Eukaryota       Chordata        Actinopteri     Pleuronectiformes       Parali
chthyidae       Paralichthys    Paralichthys olivaceus  CCTCCACATCGGCCGAGGTCTATACTACGGCTCTTTTCTGTATAAAGAAACATGAAATGTTGGCGTCATCCTGCTGCTTCTCGTAATGATGACCGCCTTTGTTGGTTACGTCCTTCCCTGAGGACA
AATATCATTCTGGGGTGCCACTGTCATCACCAACCTACTCTCAGCCGTACCTTATGTCGGTAACACCCTAGTACAATGGATCTGAGGCGGATTTTCTGTAGATAATGCCACACTCACCCGGTTCTTTGCATTCCACTTCC

From complete full-length mitogenomes

  • Extract accession numbers and sequences and combine into a tab-separated table with header:
cat mitogenomes/*.fa | sed "s/>.*$/&#/" | tr -d "\n" | tr ">" "\n" | grep -v ^$ | awk -F "|" '{OFS="#"}{print $4,$7}' | sed "s/\.[0-9]#.*#/#/" | tr "#" "\t" >complete.full.seqtable
cat complete.seq.header complete.full.seqtable >complete.seq.tsv
  • Merge accession number, gene description, taxonomic information and sequence into a tab-separated table complete.ref.tsv:
paste -d "\t" complete.full.gene.tsv complete.full.taxtable complete.seq.tsv | awk -F "\t" '{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12}' >complete.ref.tsv
  • Format of complete.ref.tsv with sequence truncated to improve readability:
Accession       Gene definition taxid    Superkingdom    Phylum  Class   Order   Family  Genus  Species  Sequence
NC_000860       Salvelinus fontinalis, complete mitogenome      8038    Eukaryota       ChordataActinopteri     Salmoniformes   Salmonidae      Salvelinus      Salvelinus fontinalis   GCTGGCGTAGCTTAATTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAAAAAGTCCCGCAGGCACAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACTGAACTTACACATGCAAGTCTCCGCACTCCTGTGAGGATGCCCTTAATCCCCTGCCCGGGGACGAGGAGCCGGCATCAGGCGCGCCCAGGCAGCCCAAGACGCCTTGCTAAGCCACACCCCCAAGGAAACTCAGCAGTGA

Merge complete mitogenomes and complete+partial mitogenomes

head -1 complete.partial.ref.tsv >ref.header
cat complete.partial.ref.tsv complete.ref.tsv | grep -v "^accession" >mitofish.ref
cat ref.header mitofish.ref >mitofish.ref.tsv
rm mitofish.ref