Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659

Open
olgabot opened this issue Mar 26, 2019 · 3 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Mar 26, 2019

If the nucleotide sequence is not a multiple of three, add Ns to the end and still translate. In some cases, because of the third wobble base, there can still be a valid translation, e.g. ACN can still be translated to Threonine (T).

     E  T  X
    *  D  X
   V  R  L
NNGTGAGACTANN
NNCACTCTGATNN
   H  S  *
  T  L  S
 X  S  V     

Also I wish there was a nice way to deal with the Xs in the sequence... would it be at all reasonable to hash all potential amino acids given the first base(s)? e..g for GAGACTANN --> ETX could one hash all of the following amino acids:

  • GAGACTAU{U,C,A}: ETI
  • GAGACTAUG: ETM
  • GAGACTAC{U,C,A,G}: ETT
  • GAGACTAC{C,U}: ETN
  • GAGACTAC{A,G}: ETK
  • GAGACTAG{U,C}: ETS
  • GAGACTAG{A,G}: ETR
@olgabot olgabot changed the title Pad non-multiples-of-three nucleotide sequences for protein translation Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly Mar 26, 2019
@bluegenes
Copy link
Contributor

bluegenes commented Jun 9, 2021

To handle this, I think we could modify the to_aa rust code for when the aa chunk length is 2, which should only happen at the end of the sequence.

full function:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L330-L346

In tracking this down, I realized we already have code in translate_codon to handle any chunk length! For len==2, we add N to the end, and still try to translate. The codon table (line 88) provides codons with an N at the 3rd position where 3rd base wobble allows for translation to same AA.
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L288-L307

So I think the only change we need here is to remove the following code from to_aa, since chunks of any size (1,2,3) can be handled within translate_codon:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/encodings.rs#L334-L336

I think this seems like a manageable rust PR for me, heh - will link when I have something. Please let me know if you see any issues with my thinking!

cc @luizirber, though he's already answered my q's on slack :)

edit: we're testing translate_codon here:
https://github.com/dib-lab/sourmash/blob/6b5806cf528583b864e1969739f65508c980ebd3/tests/test_minhash.py#L206-L214
which makes use of https://github.com/dib-lab/sourmash/blob/6b5806cf528583b864e1969739f65508c980ebd3/src/sourmash/minhash.py#L81-L87

...but I don't see a test for the higher-level to_aa function.

@bluegenes
Copy link
Contributor

Follow-up questions:

Stop codon options:

  • keep as X for dayhoff/hp
  • add (b'*', b'*'), to dayhoff/hp tables to keep as *

Larger change /probably separate issue: do we want to actually make use of *by not translating frames with * (or stopping translation at *, etc)?

@bluegenes
Copy link
Contributor

Also I wish there was a nice way to deal with the Xs in the sequence... would it be at all reasonable to hash all potential amino acids given the first base(s)? e..g for GAGACTANN --> ETX could one hash all of the following amino acids:
GAGACTAU{U,C,A}: ETI
GAGACTAUG: ETM
GAGACTAC{U,C,A,G}: ETT
GAGACTAC{C,U}: ETN
GAGACTAC{A,G}: ETK
GAGACTAG{U,C}: ETS
GAGACTAG{A,G}: ETR

Thinking through this second part -- it seems to me that this should only happen a few times per sequence (at the end of each frame), unless the sequence is full of N's. My first thought was that we don't want to do anything that would introduce unnecessary noise. I'd be especially worried if we were generating a lot of these -- would not want to run the risk of creating a k-mer that does actually exist elsewhere, thus potentially overestimating the count for it or potentially affecting ANI.

But a couple things make me think it might be ok -- It's not a huge amount of additional noise, given that we're already doing 6-frame translation. I'm not even sure how to evaluate k-mer counts or ANI of translated sequences as is -- I currently always compare 6-frame translations to protein ORF's (either reference or prodigal), rather than to other 6-frame translations (though I would be curious to see testing of the latter). And if we're doing that, then having all the potential k-mers might be ok and/or helpful.
tag @ctb for thoughts.

@olgabot - I assume this issue was most pressing when translating short reads directly, as that's likely the shortest sequence we would deal with. Do you have an idea of how much of an issue this was there? Since it would introduce complexity, I think we would want a pretty solid case for the utility...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants