-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORF extraction occasionally extracts "wrong" ORFs #64
Comments
An example of a bad ORF: ENSMUST00000159108_1:19063335-19064882 |
On Ensemble it says
|
The "No protein" part is okay (the ORF type would just be "noncoding"); however, the extracted ORF extends into the intron of that transcript (image below; top track is the annotation and the bottom track is the bad ORF); there is also nothing in the de novo assembly there. However, there is a start codon right at the end of the 3' exon in that image. I believe this is a corner case that is not handled correctly. |
Upon even further inspection, the problem is not exactly with Consider a block structure like:
So, if we ask for the genomic coordinate of transcript coordinate 10, then the correct answer (which is returned by However (following Ensemble conventions), stop codons are not included in ORFs. So, if a relevant (forward strand) stop codon begins at transcript coordinate 10, we really want to look at the genomic position which comes after transcript coordinate 9 (so, 19+1=20). N.B. This genomic coordinate need not actually be part of the transcript. The problem is essentially the same for start codons on the reverse strand since the last base of the block structure is not included (so we want to point to one genomic position past the "A" in ATG, regardless of whether it is actually part of the transcript). |
A bad ORF on the forward strand is: ENSMUST00000134384_1:4832348-4837000:+ |
This "subtract-one/add-one" fix breaks for start codons at the first position in the transcript. |
This addresses Issue #64. It was unnecessary to make this adjustment to starts since bed intervals are closed on that side. The current fix does implicitly imply that ORFs have a genomic length of at least 1. Since the start codon alone is length 3, it is not possible for a valid ORF to have length less than 3. Test cases have been implemented which suggest the correctness of the approach, and they will soon be incorporated into the package.
It is not clear why, but ORF extraction seems to sometimes extract wrong ORFs. This appears to happen when there are start or stop codons near exon boundaries, but not exclusively.
The text was updated successfully, but these errors were encountered: