Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools csq - GFF Format #1078

Open
fbemm opened this issue Sep 3, 2019 · 9 comments
Open

bcftools csq - GFF Format #1078

fbemm opened this issue Sep 3, 2019 · 9 comments

Comments

@fbemm
Copy link

fbemm commented Sep 3, 2019

Hi,

could someone explain me why the feature type for a line in a GFF is not taken from the third GFF filed but bcftools csq expect each gene and transcript with a prefix (e.g., gene: or transcript:)? Inflates GFFs pretty much with redundant information and introduces IDs that are longer than they actually have to be. Guess there is a rational that I just don't get at the moment.

Thx,
Felix

Screenshot from 2019-09-03 15-09-07

@pd3
Copy link
Member

pd3 commented Sep 3, 2019

GFFs provided by Ensembl use this convention ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

@fbemm
Copy link
Author

fbemm commented Sep 3, 2019

That I know but almost no other annotation (tool) does produce such a format. Plus, putting the prefixes there is redundant with the feature column of the GFF format specification. The feature column can probably considered more stable in its definition than a prefix of the attribute ID field. I can provide a patch if this is considered a better way of identifying GFF_TSCRIPT_LINE and GFF_GENE_LINE.

@pd3
Copy link
Member

pd3 commented Sep 3, 2019

I am open for this to be changed as long as it continues working with Ensembl files. A more general (and also an easier) solution might be to provide a new script gff2gff to convert between the various flavors of GFF files.

@fbemm
Copy link
Author

fbemm commented Sep 3, 2019

The latter might be a short term fix but one has to remove gene: or transcript: from the annotated VCF in the end again or has to live with the inflated IDs. I just checked the human GFF3:

Testing for a GFF_GENE_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

  • gene
  • ncRNA_gene
  • pseudogene

grep "ID=gene:" Homo_sapiens.GRCh38.97.chromosome.1.gff3 | cut -f3 | sort | uniq -c

Testing for a GFF_TSCRIPT_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

  • lnc_RNA
  • miRNA
  • mRNA
  • ncRNA
  • pseudogenic_transcript
  • rRNA
  • scRNA
  • snoRNA
  • snRNA
  • unconfirmed_transcript

Best to stay in sync with SequenceOntology (which Ensembl promotes):

Gene --> http://www.sequenceontology.org/browser/current_release/term/SO:0000704
Transcript --> http://www.sequenceontology.org/browser/current_release/term/SO:0000673

@brentp
Copy link

brentp commented Sep 6, 2021

Hi, would be nice if the prefix ("transcript:", "gene:", etc) were optional.
This is causing problems in otherwise reasonable GFFs, e.g. : marbl/CHM13#31

@pd3
Copy link
Member

pd3 commented Sep 7, 2021

There are too many possible variations a GFF can have, I don't want to burden bcftools csq with that complexity. I will accept a pull request that extends the https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py script and adds the prefixes when missing.

@brentp
Copy link

brentp commented Sep 15, 2021

Petr, thanks for the reply. I'll look into making a PR for the GFF,
the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF.
thanks for any ideas.
-Brent

@dKlee99
Copy link

dKlee99 commented Sep 17, 2021

Petr, thanks for the reply. I'll look into making a PR for the GFF,
the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF.
thanks for any ideas.
-Brent

@brentp Hi, any updates on this? I'm experiencing the same issue .

Best,
DK

@pd3
Copy link
Member

pd3 commented Sep 17, 2021

Phase is 8th column of GFF https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/file-formats/about-ncbi-gff3.
The program detected some inconsistency between expected and observed phase (frame).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants