-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggesting <NON_REF> as the unspecified alt allele #380
Conversation
lbergelson
commented
Feb 6, 2019
- Adding <NON_REF> to the spec and recommending its use over the <*> allele which is easily confused with *.
- closes <*> allele vs <NON_REF> #352
I don't know why circle-ci ran on this and failed... Strange. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question: I have noticed that GATK outputs VCF v4.2 but this change is only applied to v4.3, because that is when the gVCF section was added. How does this affect you @lbergelson ? Would GATK migrate to v4.3 if this was added?
VCFv4.3.tex
Outdated
@@ -1347,19 +1347,20 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)} | |||
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}. | |||
The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. | |||
Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). | |||
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. | |||
Either of the two symbolic alleles $<$*$>$ or $<$NON\_REF$>$ may be used to represent this unspecified alternate allele. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are recommending the usage of NON_REF, maybe it should be listed first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've switched them.
@cyenyxe That's a good question about 4.3 vs 4.2. Gatk doesn't output 4.3 due to technical problems in htsjdk. We're planning on fixing it but it hasn't happened yet. This PR isn't a deciding factor either way. I think we could probably backport port the gvcf blurb to the 4.2 spec since it's really more of a clarification about a specific use of VCF than a change in the spec and clearly gvcfs have existed before 4.2. If you think we should include this note in the 4.2 spec I can move that back as well. |
I could be fine with backporting the gVCF part, as it doesn't break compatibility and would make GATK VCFs (not a small number) fully compliant against the spec. What do you think @pd3? |
@lh3 came up with the idea of the space-holder for an unseen allele in samtools/bcftools and when GATK started using it as well, we agreed on the symbolic star allele |
The problem with <*> is that it is an extremely confusing notation (see comments from non-GATK developers in #352). We can always create a brief poll to understand how people are using them, and which one would they prefer in the long term. But the truth is that there are many VCF files out there already using NON_REF. |
I agree the notation is not well described in the specification, but I disagree it is extremely confusing. The correct way forward for GATK is to switch to producing |
A few thoughts:
Here's an idea: how about instead of extending the spec to officially recognize |
@tfenne, on my hard drive lives the file samtools/bam2bcf.c with a timestamp from Dec 20 2011 with the following line
and very likely it is not the first version of it. My points are following:
Although I personally would have preferred to continue using the As for your suggestion, the use of the symbolic non-ref allele is very specific and I am not convinced that it is worthwhile to complicate further the |
My take on this, as an independent non-VCF person, is that if things don't break the specification but are in common usage then it doesn't harm to document them. However we need to be clear to distinguish normative specification text and explanatory / clarificatory text. If the scale of the problem is simply too vast (eg the TLEN problem of 5' to 3' vs left-most to right-most is dominated by code doing the wrong thing, because the spec changed from one thing to another and hardly anyone updated their code), then documenting both is a pragmatic alternative but not an ideal scenario. I'd be wary of blessing one over the other though - if you go down the road of permitting both then you are just accepting both will exist forever. If you want to get only method and feel it is achievable, then don't label both as permitted - just document the unsatisfactory situation and emphasise the recommended practice is the current behaviour. |
I prefer
Bcftools was separated from samtools in samtools v0.1.9, released on 2010-10-27. Bcftools had been using Personally, I am ok to add
If we add |
* Adding <NON_REF> to the spec and recommending its use over the <*> allele which is easily confused with *. * closes samtools#352
After several offline discussions, I have updated this to acknowledge the existence of <NON_REF>, but keep <*> as the preferred allele. |
VCFv4.3.tex
Outdated
@@ -1362,7 +1362,7 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)} | |||
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}. | |||
The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. | |||
Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). | |||
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. | |||
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. This is preferred over the symbolic allele $<$NON\_REF$>$, used by the Genome Analysis Toolkit (GATK). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyenyxe This seems like the right solution, the community has definitely overruled my initial suggestion to promote <NON_REF>
over <*>
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. This is preferred over the symbolic allele $<$NON\_REF$>$, used by the Genome Analysis Toolkit (GATK). | |
A symbolic allele $<$*$>$ is used to represent this unspecified alternate allele. | |
This is preferred over the equivalent allele $<$NON\_REF$>$. |
I think we have a redundant alternate
in there, and we should split multiple sentences on multiple lines to match the formatting.
I think it's good to specify that they have the same meaning, and I don't know if it's necessary to include the GATK by name here since we don't usually refer to specific implementations.
It might also be useful to highlight in some way the term unspecified alternate allele
if that's how we intend it to be pronounced, since my major complaint with <*>
is that it's pronunciation causes confusion.
* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431