Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

olgabot · 2019-03-26T15:57:38Z

If the nucleotide sequence is not a multiple of three, add Ns to the end and still translate. In some cases, because of the third wobble base, there can still be a valid translation, e.g. ACN can still be translated to Threonine (T).

     E  T  X
    *  D  X
   V  R  L
NNGTGAGACTANN
NNCACTCTGATNN
   H  S  *
  T  L  S
 X  S  V

Also I wish there was a nice way to deal with the Xs in the sequence... would it be at all reasonable to hash all potential amino acids given the first base(s)? e..g for GAGACTANN --> ETX could one hash all of the following amino acids:

GAGACTAU{U,C,A}: ETI
GAGACTAUG: ETM
GAGACTAC{U,C,A,G}: ETT
GAGACTAC{C,U}: ETN
GAGACTAC{A,G}: ETK
GAGACTAG{U,C}: ETS
GAGACTAG{A,G}: ETR

The text was updated successfully, but these errors were encountered:

bluegenes · 2021-06-09T17:21:06Z

To handle this, I think we could modify the to_aa rust code for when the aa chunk length is 2, which should only happen at the end of the sequence.

full function:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L330-L346

In tracking this down, I realized we already have code in translate_codon to handle any chunk length! For len==2, we add N to the end, and still try to translate. The codon table (line 88) provides codons with an N at the 3rd position where 3rd base wobble allows for translation to same AA.
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L288-L307

So I think the only change we need here is to remove the following code from to_aa, since chunks of any size (1,2,3) can be handled within translate_codon:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L334-L336

I think this seems like a manageable rust PR for me, heh - will link when I have something. Please let me know if you see any issues with my thinking!

cc @luizirber, though he's already answered my q's on slack :)

edit: we're testing translate_codon here:
https://github.com/dib-lab/sourmash/blob/6b5806cf528583b864e1969739f65508c980ebd3/tests/test_minhash.py#L206-L214
which makes use of https://github.com/dib-lab/sourmash/blob/6b5806cf528583b864e1969739f65508c980ebd3/src/sourmash/minhash.py#L81-L87

...but I don't see a test for the higher-level to_aa function.

bluegenes · 2021-06-09T19:07:58Z

Follow-up questions:

In to_aa, do we want to continue ignoring chunk size 1 (only modify for chunk size 2), since we will never get an informative aa (always X)?
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L289-L291
Stop codons: what is our intended/desired behavior for all alphas?
- Current:
  - represented as * in protein AA, e.g. https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L106-L107
  - No translation provided in dayhoff or hp tables, so they are returned as X; see: https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L315-L327
    Note: I'm not actually sure what happens to * during _hash_murmur - I assume nothing special, so a kmer with X would produce a different hash than a k-mer with * at the same position.

Stop codon options:

keep as X for dayhoff/hp
add (b'*', b'*'), to dayhoff/hp tables to keep as *

Larger change /probably separate issue: do we want to actually make use of *by not translating frames with * (or stopping translation at *, etc)?

bluegenes · 2021-06-09T19:33:34Z

Also I wish there was a nice way to deal with the Xs in the sequence... would it be at all reasonable to hash all potential amino acids given the first base(s)? e..g for GAGACTANN --> ETX could one hash all of the following amino acids:
GAGACTAU{U,C,A}: ETI
GAGACTAUG: ETM
GAGACTAC{U,C,A,G}: ETT
GAGACTAC{C,U}: ETN
GAGACTAC{A,G}: ETK
GAGACTAG{U,C}: ETS
GAGACTAG{A,G}: ETR

Thinking through this second part -- it seems to me that this should only happen a few times per sequence (at the end of each frame), unless the sequence is full of N's. My first thought was that we don't want to do anything that would introduce unnecessary noise. I'd be especially worried if we were generating a lot of these -- would not want to run the risk of creating a k-mer that does actually exist elsewhere, thus potentially overestimating the count for it or potentially affecting ANI.

But a couple things make me think it might be ok -- It's not a huge amount of additional noise, given that we're already doing 6-frame translation. I'm not even sure how to evaluate k-mer counts or ANI of translated sequences as is -- I currently always compare 6-frame translations to protein ORF's (either reference or prodigal), rather than to other 6-frame translations (though I would be curious to see testing of the latter). And if we're doing that, then having all the potential k-mers might be ok and/or helpful.
tag @ctb for thoughts.

@olgabot - I assume this issue was most pressing when translating short reads directly, as that's likely the shortest sequence we would deal with. Do you have an idea of how much of an issue this was there? Since it would introduce complexity, I think we would want a pretty solid case for the utility...

olgabot changed the title ~~Pad non-multiples-of-three nucleotide sequences for protein translation~~ Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly Mar 26, 2019

olgabot mentioned this issue Apr 7, 2019

with --protein, translate whole sequence and then emit k-mers #664

Closed

luizirber added enhancement idea labels May 1, 2019

ctb mentioned this issue May 25, 2020

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed

ctb mentioned this issue May 15, 2021

summary: further improvements to protein handling in sourmash #1525

Open

bluegenes mentioned this issue Jun 10, 2021

[EXP] use codon degeneracy to properly translate edge amino acids if possible #1579

Open

bluegenes mentioned this issue Dec 18, 2024

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

olgabot commented Mar 26, 2019 •

edited

Loading

bluegenes commented Jun 9, 2021 •

edited

Loading

bluegenes commented Jun 9, 2021

bluegenes commented Jun 9, 2021

Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

Comments

olgabot commented Mar 26, 2019 • edited Loading

bluegenes commented Jun 9, 2021 • edited Loading

bluegenes commented Jun 9, 2021

bluegenes commented Jun 9, 2021

olgabot commented Mar 26, 2019 •

edited

Loading

bluegenes commented Jun 9, 2021 •

edited

Loading