variousperlscriptsenglish

Introduction

This tutorial shows sample Perl scripts for retrieving taxonomic information, multiple genome analysis, reconstructing gene tree, calculating statistics for a genome or a gene. We use GenBank files (*.gbk) in the directory 'g-language-1.8.13/share/genomes/'. The functions used in this tutorial are available at http://kobesearch.cpan.org/htdocs/Bio-Glite/Bio/Glite.html

Example 1- Retrieving taxonomic information

The following Perl script loads the GenBank-formatted file of E.coli genome, and prints taxonomic information.

use G;
my $gb = load("g-language-1.8.13/share/genomes/ecoli.gbk");
foreach(sort keys %{$gb->{TAXONOMY}}){ print "$_: ",$gb->{TAXONOMY}->{$_},"\n"; }

Executing this script prints the following output.

  
             __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
  
                   G-language  Genome Analysis Environment v.1.8.13
  
  
                             http://www.g-language.org/
  
              Please cite:
                 Arakawa K. et al. (2003) Bioinformatics.
                 Arakawa K. et al. (2006) Journal of Pestice Science.
                 Arakawa K. et al. (2008) Genes, Genomes and Genomics.
  
              License: GNU General Public License
              Copyright (C) 2001-2013 G-language Project
              Institute for Advanced Biosciences, Keio University, JAPAN
  
             __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
  
  
  
  Accession Number: NC_000913
  
  Definition: Escherichia coli str. K-12 substr. MG1655, complete genome.
  
  Length of Sequence :   4641652
           A Content :   1142742 (24.62%)
           T Content :   1141382 (24.59%)
           G Content :   1177437 (25.37%)
           C Content :   1180091 (25.42%)
              Others :         0 (0.00%)
          AT Content :    49.21%
          GC Content :    50.79%
  
      Number of genes :    4497 (CDSs: 4321, tRNAs: 89, rRNAs: 22)
  
  0: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
  1: Bacteria
  2: Proteobacteria
  3: Gammaproteobacteria
  4: Enterobacteriales
  5: Enterobacteriaceae
  6: Escherichia.
  all: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
  class: Gammaproteobacteria
  domain: Bacteria
  family: Enterobacteriaceae
  genus: Escherichia
  order: Enterobacteriales
  phylum: Proteobacteria
  species: coli

Thus, $gb->{TAXONOMY}->{0} contains 'all' taxonomic information, and taxonomic ranks 1, 2, 3, 4, and 5 correspond with domain, phylum, class, order, and family, respectively.

This taxonomic information is simply checked by typing:

  $ grep -A 2 ORGANISM g-language-1.8.13/share/genomes/ecoli.gbk
    ORGANISM  Escherichia coli str. K-12 substr. MG1655
              Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
              Enterobacteriaceae; Escherichia.

Example 2- Multiple genome analysis

The following script loads all GenBank files (*.gbk) in the directory ($dir = 'g-language-1.8.13/share/genomes/'), and prints DEFINITION of these genomes.

use G;
my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
    next if($file !~ /\.gbk$/);
    my $gb = load("$dir/$file","no msg");
    print "$gb->{DEFINITION}\n";
}
closedir(DIR);

Executing this script prints the following output.

Borrelia burgdorferi B31, complete genome. Bacillus subtilis subsp. subtilis str. 168, complete genome. Buchnera aphidicola str. APS (Acyrthosiphon pisum), complete genome. Synechococcus sp. WH 8102, complete genome. Escherichia coli str. K-12 substr. MG1655, complete genome. Mycoplasma genitalium G37, complete genome. Plasmid F, complete sequence. Pyrococcus furiosus DSM 3638, complete genome.

The DEFINITION is quickly checked by typing:

$ grep DEFINITION g-language-1.8.13/share/genomes/*.gbk

Dinucleotide relative abundance (genomic signature) is a ratio of observed to expected dinucleotide frequencies. It can be calculated by:

    &signature($gb);

Executing the script should generate a vector of start and end positions, G+C content, and the relative abundance values of 16 dinucleotides for a whole genome region, as shown in http://www.pnas.org/content/96/16/9184/F1.large.jpg.

Relative synonymous codon usage (RSCU) is a ratio of observed to expected codon frequencies, and is equal to 1 in the absence of any codon usage bias. It can be calculated by:

    &codon_compiler($gb, -output=>'stdout', -data=>'RSCU');

Executing the script should generate a vector of the RSCU values of all codons (omitting start and stop codons) for a collection of all protein-coding genes in each genome. The RSCU values for Borrelia burgdorferi genome are shown in Table 1 at http://nar.oxfordjournals.org/cgi/content/full/27/7/1642

Example 3- Retrieving rRNA sequences and phylogenetic tree reconstruction

$gb->rRNA() returns the array of all feature IDs of rRNAs, and can specify rRNA species (e.g. '16S', '23S', '5S', 'SSU', 'LSU'). For exmaple, $gb->rRNA('16S') returns a list of feature IDs of 16S rRNAs, sorted by the nucleotide sequence lengths.

The following script outputs a fasta-formatted file of nucleotide sequences of the longest 16S rRNA genes for multiple bacterial strains.

use G;

my $rRNA = '16S';

open(OUT,">gene.fasta");

my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
    next if($file !~ /\.gbk$/);
    my $gb = load("$dir/$file","no msg");
    next unless($gb->rRNA($rRNA));
    print OUT ">$gb->{ORGANISM}\n", $gb->get_geneseq($gb->rRNA($rRNA)),"\n";
}
closedir(DIR);

close(OUT);

To check the number of sequences in the fasta-formatted output file (gene.fasta), type:

$ grep '>' gene.fasta | wc -l 7

Phylogenetic tree can be reconstructed by adding the following two lines: the first line invokes multiple alignment, and the second line calculates Neighbor joining (NJ) tree.

system(qq!
       clustalw -align -infile=gene.fasta -outfile=gene.aln -outorder=input
       clustalw -tree -infile=gene.aln
       !);

Executing this script generates the following three output files: the gene.aln file (multiple alignment), gene.dnd file (guide tree), and gene.ph file (NJ tree). The following phylogram was drawn using the gene.ph file as an input file of TreeView (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html).

hs_nj_rrs

Example 4- Calculating genome statistics and their correlation analyses

The following script calculates various genome statistics such as rRNA gene number, tRNA gene number, protein-coding sequence (cds) number, genome size, GC content [100x(G+C)/(A+C+G+T)], and GC skew index (gcsi), which is a measure of the degree of GC skew [(G-C)/(G+C)]
(http://www.ncbi.nlm.nih.gov/pubmed/19461976).

use G;

my @key = qw(rRNA tRNA cds size GC gcsi);
my %stat;

open(OUT,">stat.txt");
print OUT join("\t", @key), "\n";

my $dir = 'g-language-1.8.13/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
    next if($file !~ /\.gbk$/);
    my $gb = load("$dir/$file","no msg");

    $stat{rRNA} = scalar $gb->rRNA(); #	rRNA gene number
    $stat{tRNA} = scalar $gb->tRNA(); #	tRNA gene number
    $stat{cds} = scalar $gb->cds(); # protein-coding sequence (cds) number
    $stat{size} = length($gb->{SEQ}); #	genome size

    my $acgt = $gb->{SEQ} =~ tr/acgt/acgt/; # A+C+G+T
    my $gc = $gb->{SEQ} =~ tr/gcGC/gcGC/; # G+C
    $stat{GC} = sprintf "%.1f", 100 * $gc / $acgt; # GC	content [100x(G+C)/(A+C+G+T)]

    $stat{gcsi} = (&gcsi($gb))[0]; # GC	skew index (gcsi)

    my @tmp;
    foreach (@key){ push(@tmp, $stat{$_}); }
    print OUT join("\t", @tmp), "\n";
}
closedir(DIR);

close(OUT);

Executing this script generates the following tab-delimited output file ('stat.txt').

  rRNA    tRNA    cds     size    GC      gcsi
  5       18      851     910724  28.6    0.44353725960019
  30      86      4176    4215606 43.5    0.214855185905019
  3       32      564     640681  26.3    0.126305531560014
  6       44      2519    2434428 59.4    0.0837418081678612
  22      89      4321    4639675 50.8    0.0966615833014818
  3       36      476     580076  31.7    0.127587425631069
  0       0       108     99159   48.2    0.20973363827319
  4       46      2125    1908256 40.8    0.0203799266306591

Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the following R commands (http://www.bioconductor.org/mogr/chapter-code/TwoColorPre.R).

my $rcmd = new Rcmd;
my @result = $rcmd->exec(
                         qq!
                         dat = read.delim("stat.txt")
                         postscript("Rplot.ps", pointsize=10, width=10, height=50)
                         upPan <- function(...) points(..., col="darkblue")
                         lowPan <- function(x, y, ...) text(mean(par("usr")[1:2]), mean(par("usr")[3:4]),signif(cor(x, y),2),cex=2)
                         pairs(dat, pch=20, lower.panel=lowPan, upper.panel=upPan)
                         dev.off()
                         !
                         );

Executing this script generates the following 'Rplot.ps' file.

genomestatrplot

There are a strong positive correlation between rRNA gene number and tRNA gene number (see rRNA-tRNA plot, r=0.9), as reported previouly (http://nar.oxfordjournals.org/cgi/content/full/33/4/1141/FIG4), and between the number of protein-coding sequences and genome size (see cds-size plot, r=1), as reported previouly (http://www.pnas.org/content/101/9/3160/F2.expansion.html).

Example 5- Calculating gene statistics

The following script prints various statistics for protein-coding sequences (cds) in Mycoplasma genitalium genome such as start position (start), GC content at 3rd codon positions (gcc3), GC skew at 3rd codon positions (gcs3), the codon adaptation index (cai) (http://www.ncbi.nlm.nih.gov/pubmed/3547335), and the weighted sum of relative entropy (Ew) as a measure of the degree of synonymous codon evenness (http://www.ncbi.nlm.nih.gov/pubmed/15194186).

 perl>use G;

my $gb = load("g-language-1.8.13/share/genomes/mgen.gbk");

foreach my $cds ($gb->cds()){
    &bui($gb, -output=>'n', -id=>$cds, -position=>3); # base usage indices (gcc3, gcs3) at 3rd codon posotions
}
&cai($gb, -output=>'n', -w_output=>'n');
&Ew($gb, -output=>'n');

my @key = qw(start gcc3 gcs3 cai Ew);
open(OUT,">stat.txt");
print OUT join("\t", @key), "\n";
foreach my $cds ($gb->cds()){
    my @tmp;
    foreach (@key){ push(@tmp, $gb->{$cds}->{$_}); }
    print OUT join("\t", @tmp), "\n";
}
close(OUT);

Executing this script generates the following tab-delimited output file ('stat.txt').

$ head stat.txt start gcc3 gcs3 cai Ew 686 0.1583 -0.1333 0.7351 0.6295 1828 0.1780 -0.1636 0.7443 0.6675 2845 0.2296 -0.1409 0.7257 0.7239 4812 0.1868 -0.0769 0.7164 0.7150 7294 0.1827 +0.0526 0.7566 0.6604 8551 0.1818 +0.0526 0.7391 0.6084 9156 0.1304 -0.3939 0.7644 0.5625 9923 0.1519 +0.0746 0.7635 0.6090 11251 0.1533 -0.1500 0.7432 0.6235

Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the R commands described above (Example 4). Executing this script generates the following 'Rplot.ps' file.

cdsstatrplot_mgen

M. genitalium has a very distinctive variation in GC content within the genome (see start-gcc3 plot), and GC content is correlated with codon usage bias (see gcc3-cai and gcc3-Ew plots), as reported previously (Figure 2 in http://www.horizonpress.com/cimb/v/v3/v3n403.pdf). In Borrelia burgdorferi ("g-language-1.8.13/share/genomes/bbur.gbk"), GC skew (gcs3) is correlated with codon usage bias.

Related Projects

Contact

Kazuharu Arakawa, Ph.D.

G-language Project Leader Associate Professor

Institute for Advanced Biosciences Keio University

997-0017 Japan Tel/Fax: +81-235-29-0800 gaou@sfc.keio.ac.jp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly