ORF extraction occasionally extracts "wrong" ORFs #64

bmmalone · 2017-03-31T11:42:13Z

It is not clear why, but ORF extraction seems to sometimes extract wrong ORFs. This appears to happen when there are start or stop codons near exon boundaries, but not exclusively.

bmmalone · 2017-03-31T11:45:28Z

An example of a bad ORF: ENSMUST00000159108_1:19063335-19064882

CDieterich · 2017-03-31T11:47:46Z

On Ensemble it says
Gm15825-001 ENSMUST00000159108.2 918 No protein
Antisense

TSL:1GENCODE basic

http://www.ensembl.org/Mus_musculus/Transcript/Sequence_cDNA?db=core;g=ENSMUSG00000089787;r=1:19063300-19104840;t=ENSMUST00000159108

bmmalone · 2017-03-31T11:55:34Z

The "No protein" part is okay (the ORF type would just be "noncoding"); however, the extracted ORF extends into the intron of that transcript (image below; top track is the annotation and the bottom track is the bad ORF); there is also nothing in the de novo assembly there. However, there is a start codon right at the end of the 3' exon in that image. I believe this is a corner case that is not handled correctly.

bmmalone · 2017-03-31T12:28:13Z

After looking more, it seems the problem for forward-strand ORFs is when an in-frame stop is the first codon for an exon.

For reverse-strand ORFs, the problem seems to be start codons at the 5' end of an exon.

So this is a problem in misc.bio_utils.bed_utils.get_gen_pos

bmmalone · 2017-03-31T13:35:12Z

Upon even further inspection, the problem is not exactly with misc.bio_utils.bed_utils.get_gen_pos.

Consider a block structure like:

genomic coordinates: [10, 20), [30,40)
transcript coordinates: [0, 10), [10, 20)

So, if we ask for the genomic coordinate of transcript coordinate 10, then the correct answer (which is returned by get_gen_pos) is 30.

However (following Ensemble conventions), stop codons are not included in ORFs. So, if a relevant (forward strand) stop codon begins at transcript coordinate 10, we really want to look at the genomic position which comes after transcript coordinate 9 (so, 19+1=20). N.B. This genomic coordinate need not actually be part of the transcript.

The problem is essentially the same for start codons on the reverse strand since the last base of the block structure is not included (so we want to point to one genomic position past the "A" in ATG, regardless of whether it is actually part of the transcript).

bmmalone · 2017-03-31T13:39:28Z

A bad ORF on the forward strand is: ENSMUST00000134384_1:4832348-4837000:+

This addresses Issue #64.

bmmalone · 2017-03-31T16:47:46Z

This "subtract-one/add-one" fix breaks for start codons at the first position in the transcript.

This addresses Issue #64. It was unnecessary to make this adjustment to starts since bed intervals are closed on that side. The current fix does implicitly imply that ORFs have a genomic length of at least 1. Since the start codon alone is length 3, it is not possible for a valid ORF to have length less than 3. Test cases have been implemented which suggest the correctness of the approach, and they will soon be incorporated into the package.

bmmalone added the bug label Mar 31, 2017

bmmalone self-assigned this Mar 31, 2017

bmmalone added a commit that referenced this issue Mar 31, 2017

FIX coordinates of orfs with start/stop codon on exon boundary

63ca81c

This addresses Issue #64.

bmmalone closed this as completed Mar 31, 2017

bmmalone reopened this Mar 31, 2017

tmargus mentioned this issue Mar 31, 2017

strong bias of found ORF classes between strands #54

Closed

bmmalone closed this as completed Apr 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORF extraction occasionally extracts "wrong" ORFs #64

ORF extraction occasionally extracts "wrong" ORFs #64

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

CDieterich commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

ORF extraction occasionally extracts "wrong" ORFs #64

ORF extraction occasionally extracts "wrong" ORFs #64

Comments

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

CDieterich commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017

bmmalone commented Mar 31, 2017