bcftools csq - GFF Format #1078

fbemm · 2019-09-03T13:20:05Z

Hi,

could someone explain me why the feature type for a line in a GFF is not taken from the third GFF filed but bcftools csq expect each gene and transcript with a prefix (e.g., gene: or transcript:)? Inflates GFFs pretty much with redundant information and introduces IDs that are longer than they actually have to be. Guess there is a rational that I just don't get at the moment.

Thx,
Felix

The text was updated successfully, but these errors were encountered:

pd3 · 2019-09-03T13:52:02Z

GFFs provided by Ensembl use this convention ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

fbemm · 2019-09-03T14:24:40Z

That I know but almost no other annotation (tool) does produce such a format. Plus, putting the prefixes there is redundant with the feature column of the GFF format specification. The feature column can probably considered more stable in its definition than a prefix of the attribute ID field. I can provide a patch if this is considered a better way of identifying GFF_TSCRIPT_LINE and GFF_GENE_LINE.

pd3 · 2019-09-03T14:30:59Z

I am open for this to be changed as long as it continues working with Ensembl files. A more general (and also an easier) solution might be to provide a new script gff2gff to convert between the various flavors of GFF files.

fbemm · 2019-09-03T14:45:32Z

The latter might be a short term fix but one has to remove gene: or transcript: from the annotated VCF in the end again or has to live with the inflated IDs. I just checked the human GFF3:

Testing for a GFF_GENE_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

gene
ncRNA_gene
pseudogene

grep "ID=gene:" Homo_sapiens.GRCh38.97.chromosome.1.gff3 | cut -f3 | sort | uniq -c

Testing for a GFF_TSCRIPT_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

lnc_RNA
miRNA
mRNA
ncRNA
pseudogenic_transcript
rRNA
scRNA
snoRNA
snRNA
unconfirmed_transcript

Best to stay in sync with SequenceOntology (which Ensembl promotes):

Gene --> http://www.sequenceontology.org/browser/current_release/term/SO:0000704
Transcript --> http://www.sequenceontology.org/browser/current_release/term/SO:0000673

brentp · 2021-09-06T06:51:41Z

Hi, would be nice if the prefix ("transcript:", "gene:", etc) were optional.
This is causing problems in otherwise reasonable GFFs, e.g. : marbl/CHM13#31

pd3 · 2021-09-07T14:33:39Z

There are too many possible variations a GFF can have, I don't want to burden bcftools csq with that complexity. I will accept a pull request that extends the https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py script and adds the prefixes when missing.

brentp · 2021-09-15T10:14:55Z

Petr, thanks for the reply. I'll look into making a PR for the GFF,
the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF.
thanks for any ideas.
-Brent

dKlee99 · 2021-09-17T04:55:20Z

Petr, thanks for the reply. I'll look into making a PR for the GFF,
the latest error is:
Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).
do you have a recommendation for this? There are no phase annotations in the GFF.
thanks for any ideas.
-Brent

@brentp Hi, any updates on this? I'm experiencing the same issue .

Best,
DK

pd3 · 2021-09-17T09:34:17Z

Phase is 8th column of GFF https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/file-formats/about-ncbi-gff3.
The program detected some inconsistency between expected and observed phase (frame).

pd3 mentioned this issue Apr 29, 2020

[FR] overlapping CDS #1208

Closed

pd3 mentioned this issue Dec 14, 2021

csq chokes on GFF with invalid CDS phase information #1628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools csq - GFF Format #1078

bcftools csq - GFF Format #1078

fbemm commented Sep 3, 2019

pd3 commented Sep 3, 2019

fbemm commented Sep 3, 2019

pd3 commented Sep 3, 2019

fbemm commented Sep 3, 2019

brentp commented Sep 6, 2021

pd3 commented Sep 7, 2021 •

edited

Loading

brentp commented Sep 15, 2021

dKlee99 commented Sep 17, 2021

pd3 commented Sep 17, 2021 •

edited

Loading

bcftools csq - GFF Format #1078

bcftools csq - GFF Format #1078

Comments

fbemm commented Sep 3, 2019

pd3 commented Sep 3, 2019

fbemm commented Sep 3, 2019

pd3 commented Sep 3, 2019

fbemm commented Sep 3, 2019

brentp commented Sep 6, 2021

pd3 commented Sep 7, 2021 • edited Loading

brentp commented Sep 15, 2021

dKlee99 commented Sep 17, 2021

pd3 commented Sep 17, 2021 • edited Loading

pd3 commented Sep 7, 2021 •

edited

Loading

pd3 commented Sep 17, 2021 •

edited

Loading