Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtf2gff3.pl doesn't work with recent versions of Ensembl #35

Open
keiranmraine opened this issue Nov 20, 2013 · 3 comments
Open

gtf2gff3.pl doesn't work with recent versions of Ensembl #35

keiranmraine opened this issue Nov 20, 2013 · 3 comments
Assignees

Comments

@keiranmraine
Copy link
Contributor

Hi,

I've found that some transcripts don't get converted correctly when using GTF files downloaded from Ensembl since they started attaching the correct bio-type to the genes/transcripts.

I have a set of files I can email to an appropriate account, details of them are below.

In the examples e58 versions result in correct protein coding records that display in GBrowse fine. Under e71+ BRAF (plus many others) doesn't display under GBrowse if you choose a source of protein_coding . I think this is because the same ENSG can link to protein_coding and, in this example, nonsense_mediated_decay and retained_intron. In this case the gene record is tagged as retained_intron.

I think the solution is to have a gene record for each but I'm not too sure if this is possible as this will result in 3 gene entries for the same ENSG (but each would be for a different source and potentially a different length).

Hope I've provided sufficient info to get this looked at quickly as it's preventing us from upgrading our Ensembl version in the browser.

Kind regards,

Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

BRAF_e58.gtf/gff3 are the GTF records and corresponding gff3 generated using the default cfg found on sequenceontology.org.

BRAF_e71.gtf/gff3 as above.

grep_for_ENST00000496384.txt is the result of grepping the two gff3 files for what the filename describes... this suggest this bit is fine

grep_for_ID=ENSG00000157764.txt is the result of grepping the two gff3 files for what the filename describes...in e71 the gene has been marked as a retained intron.

@ghost ghost assigned scottcain Nov 27, 2013
@scottcain
Copy link
Member

Hi Keiran,

There are actually two different versions of this script--one that is part of the GBrowse distribution and one that is hosted at the Sequence Ontology. I'm working on figuring out which one is "best" and where it will be housed going forward; in that process, I'll address this bug.

Scott

@keiranmraine
Copy link
Contributor Author

Hi Scott,

I've had a little time to have a look and I think the following will make it work for ensembl data where 'gene_biotype' has been defined (and so work with other incarnations of gtf).

diff --git a/bin/gtf2gff3.pl b/bin/gtf2gff3.pl
index 1058960..663afcf 100755
--- a/bin/gtf2gff3.pl
+++ b/bin/gtf2gff3.pl
@@ -154,6 +154,11 @@ sub parse_gtf {

                $feature_type = $INPUT_FEATURE_MAP->{$feature_type};

+    if(exists $attributes->{'gene_biotype'}) {
+      $source = $attributes->{'gene_biotype'}->[0];
+      delete $attributes->{'gene_biotype'};
+    }
+
                #Note here that we're mapping between GTF/GFF and GFF3
                #nomeclature for the hash keys
                my $feature = {seq_id     => $seqname,

It would be ideal if the 'transcript biotype' could be appended to the attributes output in the gff3 but that's a nicety.

Regards,
Keiran

@keiranmraine
Copy link
Contributor Author

There's a typo too:

@@ -1038,7 +1043,7 @@ sub validate_and_build_gene {

                print STDERR "ERROR: seq_id conflict: " .
                  "validate_and_build_gene\n" if $seq_id ne $trnsc->{seq_id};
-               print STDERR "ERROR: sourc`e conflict: " .
+               print STDERR "ERROR: source conflict: " .
                  "validate_and_build_gene\n" if $source ne $trnsc->{source};
                print STDERR "ERROR: strand conflict: " .
                  "validate_and_build_gene\n" if $strand ne $trnsc->{strand};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants