Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORF extraction occasionally extracts "wrong" ORFs #64

Closed
bmmalone opened this issue Mar 31, 2017 · 7 comments
Closed

ORF extraction occasionally extracts "wrong" ORFs #64

bmmalone opened this issue Mar 31, 2017 · 7 comments
Assignees
Labels

Comments

@bmmalone
Copy link
Contributor

It is not clear why, but ORF extraction seems to sometimes extract wrong ORFs. This appears to happen when there are start or stop codons near exon boundaries, but not exclusively.

@bmmalone bmmalone added the bug label Mar 31, 2017
@bmmalone bmmalone self-assigned this Mar 31, 2017
@bmmalone
Copy link
Contributor Author

An example of a bad ORF: ENSMUST00000159108_1:19063335-19064882

@CDieterich
Copy link

On Ensemble it says
Gm15825-001 ENSMUST00000159108.2 918 No protein
Antisense

  • TSL:1GENCODE basic

http://www.ensembl.org/Mus_musculus/Transcript/Sequence_cDNA?db=core;g=ENSMUSG00000089787;r=1:19063300-19104840;t=ENSMUST00000159108

@bmmalone
Copy link
Contributor Author

The "No protein" part is okay (the ORF type would just be "noncoding"); however, the extracted ORF extends into the intron of that transcript (image below; top track is the annotation and the bottom track is the bad ORF); there is also nothing in the de novo assembly there. However, there is a start codon right at the end of the 3' exon in that image. I believe this is a corner case that is not handled correctly.

wrong-orf

wrong-orf-start

@bmmalone
Copy link
Contributor Author

After looking more, it seems the problem for forward-strand ORFs is when an in-frame stop is the first codon for an exon.

wrong-orf-forward-end

For reverse-strand ORFs, the problem seems to be start codons at the 5' end of an exon.

wrong-orf-reverse-start

So this is a problem in misc.bio_utils.bed_utils.get_gen_pos

@bmmalone
Copy link
Contributor Author

Upon even further inspection, the problem is not exactly with misc.bio_utils.bed_utils.get_gen_pos.

Consider a block structure like:

  • genomic coordinates: [10, 20), [30,40)
  • transcript coordinates: [0, 10), [10, 20)

So, if we ask for the genomic coordinate of transcript coordinate 10, then the correct answer (which is returned by get_gen_pos) is 30.

However (following Ensemble conventions), stop codons are not included in ORFs. So, if a relevant (forward strand) stop codon begins at transcript coordinate 10, we really want to look at the genomic position which comes after transcript coordinate 9 (so, 19+1=20). N.B. This genomic coordinate need not actually be part of the transcript.

The problem is essentially the same for start codons on the reverse strand since the last base of the block structure is not included (so we want to point to one genomic position past the "A" in ATG, regardless of whether it is actually part of the transcript).

@bmmalone
Copy link
Contributor Author

A bad ORF on the forward strand is: ENSMUST00000134384_1:4832348-4837000:+

bmmalone added a commit that referenced this issue Mar 31, 2017
@bmmalone bmmalone reopened this Mar 31, 2017
@bmmalone
Copy link
Contributor Author

This "subtract-one/add-one" fix breaks for start codons at the first position in the transcript.

bmmalone added a commit that referenced this issue Apr 2, 2017
This addresses Issue #64.

It was unnecessary to make this adjustment to starts since bed intervals are
closed on that side. The current fix does implicitly imply that ORFs have a
genomic length of at least 1. Since the start codon alone is length 3, it is
not possible for a valid ORF to have length less than 3.

Test cases have been implemented which suggest the correctness of the approach,
and they will soon be incorporated into the package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants