-
Notifications
You must be signed in to change notification settings - Fork 0
problem_2b
[Practice 2-2]
Modify the script from the previous practices to estimate all three patterns of possible amino acids from target coding region.
[Practice 2-3]
First, make a program to generate complementary strand from the given DNA sequence. Then, search for the coding regions in the complementary strand and estimate all three patterns of possible amino acid sequence. Put result of the Practice 2-2, three possible sequences from template DNA, together and choose the most suitable coding region among 6 possible reading frames.
The previous program translated codon into amino acid chain from the first letter of DNA sequence. But, thinking carefully, what would happen if translation began not from the first position, but from the second position, or from the third poison?
C A T G C T G A C
~~~~~ ~~~~~ ~~~~~
His Ala Asp
~~~~~ ~~~~~ ~~~~~
Met Leu Thr
~~~~~~ ~~~~~ ~~~~~~
Cys Stop
In the above figure, the second possible reading frame starts from a Methionine, possibly indicating the start of a coding region. Likewise, the third reading frame contains a stop codon, possibly indicating an end of a coding region. The nucleotide sequence obtained in Practice 0 is an arbitrary region of the Mycoplasma genome, and therefore at this moment we are not certain from which base the translation should start. So all three possible reading frames should be tested.
There are two ways for doing this.
- Create sequences that DO NOT possesses the first base or the first two bases from the given DNA sequence, respectively, and run the program created in Practice 2-1.
- Estimate three possible patterns within a program using only one given DNA sequence.
Users can choose ether 1 or 2, but we will demonstrate 2 since such program is reusable for any types of DNA sequences.
Easy way to solve this problem is to modify previously made subroutine that translates codons into amino acids. Main point here is to slide one or two positions in target DNA sequence so that users can search suitable coding regions from among three possible reading frames.
To be more precise, put another "for" statement that loops around 0 to 2 representing each reading frame. It would be nice if users can store each reading frame into some array.
In the Practice 2-2, you have considered three possible types of reading frames, but there is one more possibility on reading frame to consider: the complementary strand.
A complementary strand is a strand complement to the template strand in a DNA duplex.
5' __G T A C G A C T G __3'
3' C A T G C T G A C 5'
DNA has a direction for each strand of the double helix which progresses from 5’ end to 3’ end. From this constraint in central dogma, a gene exists in the direction from 5’ end to 3’ end.
Double helix is consisted of two opposite strands and each strand has complementary features, such that adenine joins with thymine, and guanine with cytosine.
To obtain complementary strand from the template DNA sequence, uses can simply bring up complementary features of a nucleic acid.
- Reverse the template DNA sequence
- Substitute nucleotide pair
Putting all possibilities together, there are six types possibilities for reading frame, three from the template strand and other three from the complementary strand.
Now, let’s make a new subroutine in the Perl to generate the complementary strand from the template strand. As an implementation, it would be nice to give $seq as an argument and obtain complementary sequence as returned value. The processes for the generation are reversing the given sequence and substitution of nucleotide, A to T and G to C, or vise varsa.
Main flow of the program is same as the previous practices.
sub complemental () {
my $nuc = shift;
my $complement = '';
??????; # Reverse the sequence
??????; # Substitute nucleotides
return $complement;
}
In Perl there are handy function to reverse given string, reverse(). For example $a = reverse(“foobar”) returns “raboof”. So use function reverse() to reverse the template sequence.
For the substitution, use function tr / / / for single character substitution.
$nuc =~ tr [atgc]
[tacg];
This sample code substitutes “a” to “t”, “t” to “a”, “g” to “c”, and “c” to “g” in the $nuc. Since Perl is specialized in natural language processing, deeper understanding of the language lead you to higher level of genomics studies.
All possible reading frames are now ready. Subsequently, how can users estimate the suitable coding region from the six?
Obviously, coding region needs to be proved by experimental processes, but here one can make some estimation computationally.
The following basic rules work for the search of suitable reading frame. This rule is not at all perfect, but it is suitable as a starting point in most cases.
- A coding region begins with “atg” which codes methionine (M or Met).
- Sometimes in bacterial genome a coding region begins from “gtg” which codes valine (V or Val) but in this case it turn outs to be start codon which codes methionine.
- Stop codons are “taa”, “tga”, and “tag” which means coding region should end with these codons.
- If more than one candidate satisfy the above conditions, select the longest possible sequence.
Bioinformatics, is a field of Biology. The application of computers and informatics to biology is not merely a method, but also defines a research philosophy. On the other hand, writing of programs for the bioinformatics researches is a rather routine work.
Therefore, in order to concentrate on the biological research subject and to avoid putting too much time on the programming work, it is critical to keep the programs simple, readable, shareable, that is generic as opposed to being disposable.
Try to be non-redundant, and use subroutines effectively for reuse of code. I recommend reading technical books on Perl, such as Effective Perl (http://www.amazon.com/Effective-Perl-Programming-Idiomatic-Development/dp/0321496949/) which is one of the best books for getting up to the next level.
Thank you for going through all of these practices, and I wish you the best for your researches.
-
G-language Maps
- Institute for Advanced Biosciences
- E-Cell Simulation Environment
- E.coli multi-omics database
- Database of bacterial replication terminus
Kazuharu Arakawa, Ph.D.
G-language Project Leader Associate Professor
Institute for Advanced Biosciences Keio University
997-0017 Japan Tel/Fax: +81-235-29-0800 gaou@sfc.keio.ac.jp