-
Notifications
You must be signed in to change notification settings - Fork 0
variousperlscriptsenglish
This tutorial shows sample Perl scripts for retrieving taxonomic information, multiple genome analysis, reconstructing gene tree, calculating statistics for a genome or a gene. We use GenBank files (*.gbk) in the directory 'g-language-1.8.13/share/genomes/'. The functions used in this tutorial are available at http://kobesearch.cpan.org/htdocs/Bio-Glite/Bio/Glite.html
The following Perl script loads the GenBank-formatted file of E.coli genome, and prints taxonomic information.
use G;
my $gb = load("g-language-1.8.13/share/genomes/ecoli.gbk");
foreach(sort keys %{$gb->{TAXONOMY}}){ print "$_: ",$gb->{TAXONOMY}->{$_},"\n"; }
Executing this script prints the following output.
__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
G-language Genome Analysis Environment v.1.8.13
http://www.g-language.org/
Please cite:
Arakawa K. et al. (2003) Bioinformatics.
Arakawa K. et al. (2006) Journal of Pestice Science.
Arakawa K. et al. (2008) Genes, Genomes and Genomics.
License: GNU General Public License
Copyright (C) 2001-2013 G-language Project
Institute for Advanced Biosciences, Keio University, JAPAN
__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
Accession Number: NC_000913
Definition: Escherichia coli str. K-12 substr. MG1655, complete genome.
Length of Sequence : 4641652
A Content : 1142742 (24.62%)
T Content : 1141382 (24.59%)
G Content : 1177437 (25.37%)
C Content : 1180091 (25.42%)
Others : 0 (0.00%)
AT Content : 49.21%
GC Content : 50.79%
Number of genes : 4497 (CDSs: 4321, tRNAs: 89, rRNAs: 22)
0: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
1: Bacteria
2: Proteobacteria
3: Gammaproteobacteria
4: Enterobacteriales
5: Enterobacteriaceae
6: Escherichia.
all: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
class: Gammaproteobacteria
domain: Bacteria
family: Enterobacteriaceae
genus: Escherichia
order: Enterobacteriales
phylum: Proteobacteria
species: coli
Thus, $gb->{TAXONOMY}->{0} contains 'all' taxonomic information, and taxonomic ranks 1, 2, 3, 4, and 5 correspond with domain, phylum, class, order, and family, respectively.
This taxonomic information is simply checked by typing:
$ grep -A 2 ORGANISM g-language-1.8.13/share/genomes/ecoli.gbk
ORGANISM Escherichia coli str. K-12 substr. MG1655
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
The following script loads all GenBank files (*.gbk) in the directory ($dir = 'g-language-1.8.13/share/genomes/'), and prints DEFINITION of these genomes.
use G;
my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
next if($file !~ /\.gbk$/);
my $gb = load("$dir/$file","no msg");
print "$gb->{DEFINITION}\n";
}
closedir(DIR);
Executing this script prints the following output.
Borrelia burgdorferi B31, complete genome. Bacillus subtilis subsp. subtilis str. 168, complete genome. Buchnera aphidicola str. APS (Acyrthosiphon pisum), complete genome. Synechococcus sp. WH 8102, complete genome. Escherichia coli str. K-12 substr. MG1655, complete genome. Mycoplasma genitalium G37, complete genome. Plasmid F, complete sequence. Pyrococcus furiosus DSM 3638, complete genome.
The DEFINITION is quickly checked by typing:
$ grep DEFINITION g-language-1.8.13/share/genomes/*.gbk
Dinucleotide relative abundance (genomic signature) is a ratio of observed to expected dinucleotide frequencies. It can be calculated by:
&signature($gb);
Executing the script should generate a vector of start and end positions, G+C content, and the relative abundance values of 16 dinucleotides for a whole genome region, as shown in http://www.pnas.org/content/96/16/9184/F1.large.jpg.
Relative synonymous codon usage (RSCU) is a ratio of observed to expected codon frequencies, and is equal to 1 in the absence of any codon usage bias. It can be calculated by:
&codon_compiler($gb, -output=>'stdout', -data=>'RSCU');
Executing the script should generate a vector of the RSCU values of all codons (omitting start and stop codons) for a collection of all protein-coding genes in each genome. The RSCU values for Borrelia burgdorferi genome are shown in Table 1 at http://nar.oxfordjournals.org/cgi/content/full/27/7/1642
$gb->rRNA() returns the array of all feature IDs of rRNAs, and can specify rRNA species (e.g. '16S', '23S', '5S', 'SSU', 'LSU'). For exmaple, $gb->rRNA('16S') returns a list of feature IDs of 16S rRNAs, sorted by the nucleotide sequence lengths.
The following script outputs a fasta-formatted file of nucleotide sequences of the longest 16S rRNA genes for multiple bacterial strains.
use G;
my $rRNA = '16S';
open(OUT,">gene.fasta");
my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
next if($file !~ /\.gbk$/);
my $gb = load("$dir/$file","no msg");
next unless($gb->rRNA($rRNA));
print OUT ">$gb->{ORGANISM}\n", $gb->get_geneseq($gb->rRNA($rRNA)),"\n";
}
closedir(DIR);
close(OUT);
To check the number of sequences in the fasta-formatted output file (gene.fasta), type:
$ grep '>' gene.fasta | wc -l 7
Phylogenetic tree can be reconstructed by adding the following two lines: the first line invokes multiple alignment, and the second line calculates Neighbor joining (NJ) tree.
system(qq!
clustalw -align -infile=gene.fasta -outfile=gene.aln -outorder=input
clustalw -tree -infile=gene.aln
!);
Executing this script generates the following three output files: the gene.aln file (multiple alignment), gene.dnd file (guide tree), and gene.ph file (NJ tree). The following phylogram was drawn using the gene.ph file as an input file of TreeView (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html).
The following script calculates various genome statistics such as rRNA gene number, tRNA gene number, protein-coding sequence (cds) number, genome size, GC content [100x(G+C)/(A+C+G+T)], and GC skew index (gcsi), which is a measure of the degree of GC skew [(G-C)/(G+C)]
(http://www.ncbi.nlm.nih.gov/pubmed/19461976).
use G;
my @key = qw(rRNA tRNA cds size GC gcsi);
my %stat;
open(OUT,">stat.txt");
print OUT join("\t", @key), "\n";
my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
next if($file !~ /\.gbk$/);
my $gb = load("$dir/$file","no msg");
$stat{rRNA} = scalar $gb->rRNA(); # rRNA gene number
$stat{tRNA} = scalar $gb->tRNA(); # tRNA gene number
$stat{cds} = scalar $gb->cds(); # protein-coding sequence (cds) number
$stat{size} = length($gb->{SEQ}); # genome size
my $acgt = $gb->{SEQ} =~ tr/acgt/acgt/; # A+C+G+T
my $gc = $gb->{SEQ} =~ tr/gcGC/gcGC/; # G+C
$stat{GC} = sprintf "%.1f", 100 * $gc / $acgt; # GC content [100x(G+C)/(A+C+G+T)]
$stat{gcsi} = (&gcsi($gb))[0]; # GC skew index (gcsi)
my @tmp;
foreach (@key){ push(@tmp, $stat{$_}); }
print OUT join("\t", @tmp), "\n";
}
closedir(DIR);
close(OUT);
Executing this script generates the following tab-delimited output file ('stat.txt').
rRNA tRNA cds size GC gcsi
5 18 851 910724 28.6 0.44353725960019
30 86 4176 4215606 43.5 0.214855185905019
3 32 564 640681 26.3 0.126305531560014
6 44 2519 2434428 59.4 0.0837418081678612
22 89 4321 4639675 50.8 0.0966615833014818
3 36 476 580076 31.7 0.127587425631069
0 0 108 99159 48.2 0.20973363827319
4 46 2125 1908256 40.8 0.0203799266306591
Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the following R commands (http://www.bioconductor.org/mogr/chapter-code/TwoColorPre.R).
my $rcmd = new Rcmd;
my @result = $rcmd->exec(
qq!
dat = read.delim("stat.txt")
postscript("Rplot.ps", pointsize=10, width=10, height=50)
upPan <- function(...) points(..., col="darkblue")
lowPan <- function(x, y, ...) text(mean(par("usr")[1:2]), mean(par("usr")[3:4]),signif(cor(x, y),2),cex=2)
pairs(dat, pch=20, lower.panel=lowPan, upper.panel=upPan)
dev.off()
!
);
Executing this script generates the following 'Rplot.ps' file.
There are a strong positive correlation between rRNA gene number and tRNA gene number (see rRNA-tRNA plot, r=0.9), as reported previouly (http://nar.oxfordjournals.org/cgi/content/full/33/4/1141/FIG4), and between the number of protein-coding sequences and genome size (see cds-size plot, r=1), as reported previouly (http://www.pnas.org/content/101/9/3160/F2.expansion.html).
The following script prints various statistics for protein-coding sequences (cds) in Mycoplasma genitalium genome such as start position (start), GC content at 3rd codon positions (gcc3), GC skew at 3rd codon positions (gcs3), the codon adaptation index (cai) (http://www.ncbi.nlm.nih.gov/pubmed/3547335), and the weighted sum of relative entropy (Ew) as a measure of the degree of synonymous codon evenness (http://www.ncbi.nlm.nih.gov/pubmed/15194186).
perl>use G;
my $gb = load("g-language-1.8.13/share/genomes/mgen.gbk");
foreach my $cds ($gb->cds()){
&bui($gb, -output=>'n', -id=>$cds, -position=>3); # base usage indices (gcc3, gcs3) at 3rd codon posotions
}
&cai($gb, -output=>'n', -w_output=>'n');
&Ew($gb, -output=>'n');
my @key = qw(start gcc3 gcs3 cai Ew);
open(OUT,">stat.txt");
print OUT join("\t", @key), "\n";
foreach my $cds ($gb->cds()){
my @tmp;
foreach (@key){ push(@tmp, $gb->{$cds}->{$_}); }
print OUT join("\t", @tmp), "\n";
}
close(OUT);
Executing this script generates the following tab-delimited output file ('stat.txt').
$ head stat.txt start gcc3 gcs3 cai Ew 686 0.1583 -0.1333 0.7351 0.6295 1828 0.1780 -0.1636 0.7443 0.6675 2845 0.2296 -0.1409 0.7257 0.7239 4812 0.1868 -0.0769 0.7164 0.7150 7294 0.1827 +0.0526 0.7566 0.6604 8551 0.1818 +0.0526 0.7391 0.6084 9156 0.1304 -0.3939 0.7644 0.5625 9923 0.1519 +0.0746 0.7635 0.6090 11251 0.1533 -0.1500 0.7432 0.6235
Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the R commands described above (Example 4). Executing this script generates the following 'Rplot.ps' file.
M. genitalium has a very distinctive variation in GC content within the genome (see start-gcc3 plot), and GC content is correlated with codon usage bias (see gcc3-cai and gcc3-Ew plots), as reported previously (Figure 2 in http://www.horizonpress.com/cimb/v/v3/v3n403.pdf). In Borrelia burgdorferi ("g-language-1.8.13/share/genomes/bbur.gbk"), GC skew (gcs3) is correlated with codon usage bias.
-
G-language Maps
- Institute for Advanced Biosciences
- E-Cell Simulation Environment
- E.coli multi-omics database
- Database of bacterial replication terminus
Kazuharu Arakawa, Ph.D.
G-language Project Leader Associate Professor
Institute for Advanced Biosciences Keio University
997-0017 Japan Tel/Fax: +81-235-29-0800 gaou@sfc.keio.ac.jp