Compute codon autocorrelation (aka codon covariation, codon reuse, codon pair usage) from fasta sequences.
This code was used in the analyses of the manuscript 'Elucidation of codon usage signatures across the domains of life' (Novoa et al., Mol Biol Evol 2019), available here: https://doi.org/10.1093/molbev/msz124
Codon autocorrelation reflects the non-random distribution of consecutive codon occurrences throughout a transcript.
Previous studies in yeast have shown that once a particular codon has been used, subsequent occurrences of the same amino acid in the same transcript are not random (Cannarozzi, et al. 2010), a phenomenon termed as ‘codon autocorrelation’ or ‘codon covariation’. Mechanistically, it was argued that tRNA recycling was the driving force causing the observed biased distribution of synonymous codons along a sequence, i.e., codons that would reuse the same tRNA would be favored as a means to increase the speed of translation (Cannarozzi, et al. 2010).
A subsequent study re-examined this question, and compared the autocorrelation between codons encoding the same amino acids to those encoding different ones (Hussmann and Press 2014). Intriguingly, this second study found that covariation between codons encoding different amino acids was as strong as covariation between codons encoding the same amino acid, concluding that there was insufficient evidence to claim that tRNA recycling is the force driving codon autocorrelation.
Despite the uncertain cause of codon covariation, both studies show that the probability of observing a specific codon is dependent on previous codon occurrences.
- Scripts to compute and analyze codon autocorrelation from fasta sequences
- Output: 3 types of measurements:
- SDEVS: number of standard deviations from expected, as defined in Cannarozzi et al, Cell 2010 (https://www.cell.com/abstract/S0092-8674(10)00189-3) --> output file: *.sdevs.txt
- EFFECT MATRIX: normalized sdevs, for each amino acid --> output files: *effect_matrix.txt
- RSCPU: relative synonymous codon pair usage, as defined in Novoa et al., Mol Biol Evol 2019. (https://doi.org/10.1093/molbev/msz124) --> output files *rscpu_way1.txt and *rscpu_way2.txt
If you want to go directly from a fasta to relative codon pair usage, you can directly run:
fromFasta2CodonAutocorrelationFAST.sh <FILE.fasta>
Example:
fromFasta2CodonAutocorrelationFAST.sh Saccharomyces_cerevisiae.CDS.fasta
It is important to note that this mode will also consider consecutive codons (i.e. dicodon frequencies), which are excluded in the case of the slower version, to avoid biases that exist naturally as "dicodon". Results will be placed in 'results' folder.
This version excludes consecutive codons (i.e. dicodons) from the analysis. Furthermore this version also computes codon autocorrelation between codons that belong to different amino acids. For these two reasons, this code is much slower than the "quick start" one. In our hands, we have found that excluding consecutive codons does not significantly vary the results (at least in the few species that we have checked this), however, this may not be the case for all species. Overall, the fast version (i.e. fromFasta2CodonAutocorrelationFAST.sh) is recommended if only codon autocorrelation across codons that encode for the same amino acid is needed.
Usage:
fromFasta2CodonAutocorrelation.sh <FILE.fasta>
Example:
fromFasta2CodonAutocorrelation.sh Saccharomyces_cerevisiae.CDS.fasta
Results will be placed in 'results' folder
You can download them, for example, from the EMBL CDS database: ftp://ftp.ebi.ac.uk/
codon_autocorrelation_multiple_sequences.py <FILE.fasta>
This will generate two files: .total_codons an .paired_codons
parse_codon_autocorrelation_output.sh <FILE.paired_codons> <FILE.total_codons>
This will generate a merged file: .merged
fix_merged_file_codon_autocorrelation_counts.sh <FILE.merged>
This will generate a merged file: .FIXED
parse_codon_autocorrelation_merged.R --save < FILE.FIXED
- Code can be modified to allow for the calculation of codon autocorrelation among non-synonymous codons (ongoing work)
- Script that computes all VS all codons needs further tuning in terms of speed (ongoing work)
If you find this code useful, please cite: Novoa EM, Jungreis I, Jaillon O, Kellis M. Elucidation of codon usage signatures across the domains of life. Mol Biol Evol 2019. https://doi.org/10.1093/molbev/msz1247
If you have any doubts/questions/concerns, please contact: evamaria.novoa@gmail.com. Thanks!