You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're integrating Metaeuk into BUSCO and by screening the coordinates of the resulting genes I noticed there might be an issue about how the coordinates are reported when there are overlapping exons on the negative strand.
Here is an example of a predicted protein from Metaeuk on the - strand:
From the doc: “The exon_coords are of the structure: low[taken_low]:high[taken_high]:nucleotide_length[taken_nucleotide_length]
Since MetaEuk allows for a very short overlap on T of two putative exons (see P2 and P3 in the illustration below), when joining the sequences of the exons, one of them is shortened. The coordinates of the codons taken from this exon will be in the square brackets ([taken_low], [taken_high] and [taken_nucleotide_length]).”
But according to the coordinates in the brackets, the last two exons overlaps: exon 2 ends at 50054 but exon 3 starts at 50063. The length for the shortened exon “[123]” in the header does not correspond to the values you get if use the coordinates reported: 50063 - 49869 = 194.
Nevertheless, the protein sequence and CDS seems to be correct when I compare it to the reference from which it was predicted. The length of the third exon is 123pb which corresponds to the length reported in the header.
Indeed when I search the original scaffold with the predicted CDS using blast, it seems that the coordinate of the start of the third exon is shifted several bases respect to what is reported in the header, e.g.:
len exon_end exon_start
Exon3 CH478315.1 Query_52157 100.000 123 49869 49991
So it’s likely that the problem is only affecting the coordinates in the header and not the predicted sequences.
This looks like a problem in how MetaEuk produces the header for this case.
Could you please send me the sequences of your contig (CH478315.1) and of the reference protein (72245at7147_8)?
I apologize in advance - it may take a while - I am on maternity leave.
I managed to reproduce this on a dummy example and fix it (as of commit f32e8d). Please let me know if it still gives you trouble. Thank you for the feedback!
Hi,
We're integrating Metaeuk into BUSCO and by screening the coordinates of the resulting genes I noticed there might be an issue about how the coordinates are reported when there are overlapping exons on the negative strand.
Here is an example of a predicted protein from Metaeuk on the - strand:
From the doc: “The exon_coords are of the structure: low[taken_low]:high[taken_high]:nucleotide_length[taken_nucleotide_length]
Since MetaEuk allows for a very short overlap on T of two putative exons (see P2 and P3 in the illustration below), when joining the sequences of the exons, one of them is shortened. The coordinates of the codons taken from this exon will be in the square brackets ([taken_low], [taken_high] and [taken_nucleotide_length]).”
But according to the coordinates in the brackets, the last two exons overlaps: exon 2 ends at 50054 but exon 3 starts at 50063. The length for the shortened exon “[123]” in the header does not correspond to the values you get if use the coordinates reported: 50063 - 49869 = 194.
Nevertheless, the protein sequence and CDS seems to be correct when I compare it to the reference from which it was predicted. The length of the third exon is 123pb which corresponds to the length reported in the header.
Indeed when I search the original scaffold with the predicted CDS using blast, it seems that the coordinate of the start of the third exon is shifted several bases respect to what is reported in the header, e.g.:
So it’s likely that the problem is only affecting the coordinates in the header and not the predicted sequences.
Could you have a look into this?
Many thanks!
(I’m using metaeuk Version: e7e2d95)
The text was updated successfully, but these errors were encountered: