Skip to content

tutorialcodonusageenglish

gaou edited this page Nov 10, 2020 · 13 revisions

Introduction

The sequence of nucleobase determines the sequence of the amino acid that constitutes protein. Each amino acid is coded with the links(codon) of three of the four nucleobases(A,T/U,G,C). There are 64 ways for the three bases to line up, compared to the twenty types of amino acid that exist. Actually, there are known to be several codons that correspond to an amino acid. In this way, a codon that codes the same amino acid is called a synonym codon. A deflection, peculiar to life species, is known to exist in syononym codon usage.

Step 0 - Starting up the G-language System

Here, I will explain how to use the G-language System for beginners. As a precondition, it is expected that the biological knowledge shown above, and basic UNIX operating skills are already mastered.

Now, before we start to analyze codon usage, let’s start the G-language System. If you use the G-language System, genome analysis is very simple.

For example, if you use the data file “bsub” (the complete genome of //Bacillus subtilis// in the GenBank format bundled with G-language System) under the current directory to analyze, you only need to input the next two lines to get ready.

  use G; 
  $gb = new G("bsub"); 

Let’s try and execute the following Perl script. (Please set the file name to “test.pl”)

  perl test.pl  [ENTER] 

Did you get the following output?

             __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
                
                   G-language  Genome Analysis Environment v.1.6.10
  
  
                             http://www.g-language.org/
  
              Please cite: 
                 Arakawa et al. (2003) Bioinformatics.
                 Arakawa et al. (2006) Journal of Pestice Science.
  
              License: GNU General Public License
              Copyright (C) 2001-2007 G-language Project
              Institute for Advanced Biosciences, Keio University, JAPAN 
  
             __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
  
  
  Accession Number: AL009126 
  
    Length of Sequence :   4214814 
           A Content :   1187757 (28.18%) 
           T Content :   1192867 (28.30%) 
           G Content :    915023 (21.71%) 
           C Content :    919167 (21.81%) 
              Others :         0 (0.00%) 
          AT Content :    56.48% 
          GC Content :    43.52% 

With the output of the Accession Number and the base content statistic, the G-language System informs that it has successfully read the data file (“bsub”).

Now, we will explain the script.

  • use G: imports module G
  • $gb = new G("bsub"); loads the file “bsub” under the current directory, and stores the annotation and the base sequence under the variable ”$gb”.

Exercise 0:

Load the complete genome data for other bacteria (such as “ecoli”, “cyano”, and “mgen”).

[Hint] Rewrite the line $gb->new G("bsub"); on the above script.

Step1: Examples of standard function usage (1): Analyzing of codon usage among entire genes.

In the G-language System, there are several functions set for genome analysis. Now, let’s analyze codon usage for B.subtilis, using one of the standard functions “codon_usage()”. Please rewrite a line using the codon usage() function in the Perl script you have created in Step0.

  use G; 
  $gb = new G("bsub"); 
  codon_usage($gb); 

If you execute the sript, codon usage percentage should be shown on the display, and the following codon table should be displayed.

This is the chart of the calculation of the frequency of synonym codons in the B.subtilis genome. Each amino acids sum of synonym codons frequancy equals one. For example, for phenylalanine(code:F), the percentage of TTC is 0.315 compared to 0.685 for TTT. Of the two synonym codons, it heavily uses TTT. Of the three synonym codons that exist, Isoleucine (code :i) heavily uses ATT. As you can see, there is a pattern in which each amino acid is biased upon a peculiar codon.

Excercise 1:

Compare and examine the deflection of each amino acids synonym codon usage for Mycoplasma genitalium (“mgen”) and B.subtilis.

[Hint] Rewrite the line $gb->new G("bsub"); in the above script.

Step2 Examples of standard function usage(2) : Analyzing of codon usage for each gene

The standard functions for the G-language System has several options.

The function “codon_usage()” has the following options.

option description
-CDSid specifies the ID for CDS in codon usage calculations. Calculates all genes for default.
-output specifies the output point. “stdout” outputs to the display, “f” ouputs to a file. The default is “stdout”.
-filename specifies the output file name. The default is “codon_usage.csv”.

The option, as you can see in the following example, puts “-” in front of the name of the option and “=>”ties it with the value.

In Step 1, we calculated codon usage percentage in the whole genome universe, but now lets try calculating codon usage percentage in a certain gene.

Please rewrite the script in Step one to the following example.

  use G; 
  $gb = new G("bsub"); 
  codon_usage($gb, -CDSid=>'CDS113'); 

“CDS113” corresponds with the gene (tufA) that codes with the elongation factor (TU, EF-Tu). If you execute the above script, the codon usage percentage for the tufA gene should be shown on the display like the following example.

  / -> taa -> 1   1.000 
  A -> gca -> 2   0.074 
  A -> gcc -> 2   0.074 
  A -> gcg -> 3   0.111 
  A -> gct -> 20   0.741 
  C -> tgc -> 2   1.000 
  D -> gac -> 15   0.600 
  D -> gat -> 10   0.400 
  E -> gaa -> 32   0.762 
  E -> gag -> 10   0.238 
  F -> ttc -> 13   1.000 
  G -> gga -> 9   0.243 
  G -> ggc -> 5   0.135 
  G -> ggt -> 23   0.622 
  H -> cac -> 7   0.583 
  H -> cat -> 5   0.417 
  I -> atc -> 19   0.760 
  I -> att -> 6   0.240 
  K -> aaa -> 19   0.864 
  K -> aag -> 3   0.136 
  L -> cta -> 1   0.043 
  L -> ctt -> 20   0.870 
  L -> tta -> 2   0.087 
  M -> atg -> 14   1.000 
  N -> aac -> 8   0.889 
  N -> aat -> 1   0.111 
  P -> cca -> 14   0.824 
  P -> cct -> 3   0.176 
  Q -> caa -> 7   0.875 
  Q -> cag -> 1   0.125 
  R -> cgc -> 6   0.300 
  R -> cgt -> 14   0.700 
  S -> agc -> 2   0.118 
  S -> tca -> 3   0.176 
  S -> tcc -> 1   0.059 
  S -> tct -> 11   0.647 
  T -> aca -> 13   0.394 
  T -> act -> 20   0.606 
  V -> gta -> 13   0.342 
  V -> gtc -> 1   0.026 
  V -> gtt -> 24   0.632 
  W -> tgg -> 1   1.000 
  Y -> tac -> 9   0.818 
  Y -> tat -> 2   0.182 
  total -> 397 

It shows from the left, the abbreviation of the amino acid -> the codon -> the codon sum -> percentage of synonym codons. You can tell from this output, the deflection of synonym codons for the //tufA// gene, differs greatly from the pattern of the entire genome universe. For example, of the two synonym codons it possesses, phenylalanine (code: F) uses only TTC as a synonym codon. Of the four synonym codons it possesses, alanine (code :A) only uses GCT.

Excercise 2:

Please calculate the codon usage percentage for the gene (dnaA), witch codes DnaA protein and relates to DNA replication.

[Hint] rewrite the line “codon_usage($gb, -CDSid=>'CDS113');” in the above script.

Step 3 Access to genome data

The data read into the G-lanbuage System upon mobilization, is saved inside “$gb”. Now, we will explain simply how to access to each genome data inside “$gb”. For further explanations, use perldoc command to view the manuals,

  perldoc G [ENTER]

and refer to the perldoc documentation of G.pm. Now, I will show a part of the data file (“bsub”) that we have used.

LOCUS AL009126 4214814 bp circular BCT 10-MAY-1999

  DEFINITION  Bacillus subtilis complete genome. 
  ...skip...
  FEATURES             Location/Qualifiers 
     source          1..4214814 
                     /organism="Bacillus subtilis" 
                     /db_xref="taxon:1423" 
     gene            410..1750 
                     /gene="dnaA" 
     CDS             410..1750 
                     /gene="dnaA" 
                     /function="initiation of chromosome replication (DNA 
                     synthesis)" 
                     /note="alternate gene name: dnaH, dnaJ, dnaK" 
                     /codon_start=1 
                     /transl_table=11 
                     /db_xref="PID:e1181934" 
                     /db_xref="PID:g2632268" 
                     /translation="MENILDLWNQALAQIEKKLSKPSFETWMKSTKAHSLQGDTLTIT 
                     APNEFARDWLESRYLHLIADTIYELTGEELSIKFVIPQNQDVEDFMPKPQVKKAVKED 
                     TSDFPQNMLNPKYTFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHL 
                     MHAIGHYVIDHNPSAKVVYLSSEKFTNEFINSIRDNKAVDFRNRYRNVDVLLIDDIQF 
                     LAGKEQTQEEFFHTFNTLHEESKQIVISSDRPPKEIPTLEDRLRSRFEWGLITDITPP 
                     DLETRIAILRKKAKAEGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLINKDIN 
                     ADLAAEALKDIIPSSKPKVITIKEIQRVVGQQFNIKLEDFKAKKRTKSVAFPRQIAMY 
                     LSREMTDSSLPKIGEEFGGRDHTTVIHAHEKISKLLADDEQLQQHVKEIKEQLK" 
  ...skip...
  BASE COUNT  1187757 a 919167 c 915023 g1192867 t 
  ORIGIN       
        1 atctttttcg gcttttttta gtatccacag aggttatcga caacattttc acattaccaa 
  ...skip...
  4214761 ttacggaaaa aagacaaatt caaacaattt gcccctaaaa tcacgcatgt ggat 
  // 

$gb has the following structures.

LOCUS, HEADER, FEATURE1, FEATURE2, ... , FEATURE8444, CDS1, CDS2, ... , CDS4100, SEQ For example, information for each CDS is inside structures named CDS1, CDS2, …, CDS4100(CDS+num.). For information for each structure, you access hierarchically like below. $gb->{CDS480}->{gene}

All base sequences from Origin onward, are inside

$gb->{SEQ}”

Excercise 3:

In the script written in Step 0, rewrite the following line that outputs the beginning and ending position for ‘CDS1’, and execute the file.

  print "$gb->{CDS1}->{start}..$gb->{CDS1}->{end}"; 

Confirm the outcome corresponds to the data file above.

Step 4- For further analysis

The G-language System has many gene analyzing functions as a standard, and each standard function in its own is capable of analyzing extensively. But in actual research, it is assumed that even more complex analyzing situations will occur. For this, a certain level of programming skills is required. For perl beginners, please reference the texts for “beginners session” and “newcomers study session”.

The G-language System not only supplies functions for genome analyzing, but provides a platform to easily handle genome data bases. The platform, is a function that is possible to be summoned upon an instance in the G-language System called $gb. It has a broad array such as processing each gene, processing the perimeter of initiation codons and stop codons and processing intron exon. For further explanation, please refer to the perldoc documentation of G.pm.

$gb->cds() relays all CDS object names stored inside $gb in a sequence. For example, the script made in Question2 which analyzes codon usage percentage for the DnaA gene, should get the same result as above, even if you rewrite $gb->cds() like the following.

  use G; 
  $gb = new G("bsub"); 
   
  foreach $cds ($gb->cds()){ 
    if($gb->{$cds}->{gene} =~ /dnaA/){ 
        codon_usage($gb, -CDSid=>$cds); 
    } 
  } 

Now, let me explain the script. (1) The foreach line has the following structure.

  foreach $variable (@array){ 
       #some process here.
   } 

From the top, the element of the sequence is substituted with a variable, and is processed each time there is a substitution. Therefore, “foreach $cds ($gb->cds()){” means to substitute sequence element(the object name of CDS) to a variable called $cds in order and process it. This is the basic way to process by each gene in the G-language System.

(2)With perl you can use a strong expression called regular expression. A common way to use this is to ask “does this regular expression match a certain word sequence?” This is written “variable=~/regular expression/.

  if($gb->{$cds}->{gene} =~ /dnaA/){ 
        codon_usage($gb, -CDSid=>$cds); 
   } 

In the script above, it means if a gene($gb->{$cds}->{gene}) matches “dnaA”, it is to calculate the codon usage in the CDS.  

Clone this wiki locally