-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pad non-multiples-of-three nucleotide sequences for protein translation and deal with codon degeneracy elegantly #659
Comments
To handle this, I think we could modify the full function: In tracking this down, I realized we already have code in So I think the only change we need here is to remove the following code from I think this seems like a manageable rust PR for me, heh - will link when I have something. Please let me know if you see any issues with my thinking! cc @luizirber, though he's already answered my q's on slack :) edit: we're testing ...but I don't see a test for the higher-level |
Follow-up questions:
Stop codon options:
Larger change /probably separate issue: do we want to actually make use of |
Thinking through this second part -- it seems to me that this should only happen a few times per sequence (at the end of each frame), unless the sequence is full of N's. My first thought was that we don't want to do anything that would introduce unnecessary noise. I'd be especially worried if we were generating a lot of these -- would not want to run the risk of creating a k-mer that does actually exist elsewhere, thus potentially overestimating the count for it or potentially affecting ANI. But a couple things make me think it might be ok -- It's not a huge amount of additional noise, given that we're already doing 6-frame translation. I'm not even sure how to evaluate k-mer counts or ANI of translated sequences as is -- I currently always compare 6-frame translations to protein ORF's (either reference or prodigal), rather than to other 6-frame translations (though I would be curious to see testing of the latter). And if we're doing that, then having all the potential k-mers might be ok and/or helpful. @olgabot - I assume this issue was most pressing when translating short reads directly, as that's likely the shortest sequence we would deal with. Do you have an idea of how much of an issue this was there? Since it would introduce complexity, I think we would want a pretty solid case for the utility... |
If the nucleotide sequence is not a multiple of three, add Ns to the end and still translate. In some cases, because of the third wobble base, there can still be a valid translation, e.g.
ACN
can still be translated to Threonine (T
).Also I wish there was a nice way to deal with the
X
s in the sequence... would it be at all reasonable to hash all potential amino acids given the first base(s)? e..g forGAGACTANN
-->ETX
could one hash all of the following amino acids:GAGACTAU{U,C,A}
:ETI
GAGACTAUG
:ETM
GAGACTAC{U,C,A,G}
:ETT
GAGACTAC{C,U}
:ETN
GAGACTAC{A,G}
:ETK
GAGACTAG{U,C}
:ETS
GAGACTAG{A,G}
:ETR
The text was updated successfully, but these errors were encountered: