Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggesting <NON_REF> as the unspecified alt allele #380

Merged
merged 6 commits into from
Aug 22, 2019

Conversation

lbergelson
Copy link
Member

@lbergelson lbergelson requested review from cyenyxe and pd3 February 6, 2019 20:59
@lbergelson lbergelson added the vcf label Feb 6, 2019
@lbergelson
Copy link
Member Author

I don't know why circle-ci ran on this and failed... Strange.

Copy link
Member

@cyenyxe cyenyxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: I have noticed that GATK outputs VCF v4.2 but this change is only applied to v4.3, because that is when the gVCF section was added. How does this affect you @lbergelson ? Would GATK migrate to v4.3 if this was added?

VCFv4.3.tex Outdated
@@ -1347,19 +1347,20 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}.
The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele.
Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise).
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele.
Either of the two symbolic alleles $<$*$>$ or $<$NON\_REF$>$ may be used to represent this unspecified alternate allele.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are recommending the usage of NON_REF, maybe it should be listed first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've switched them.

@lbergelson
Copy link
Member Author

@cyenyxe That's a good question about 4.3 vs 4.2. Gatk doesn't output 4.3 due to technical problems in htsjdk. We're planning on fixing it but it hasn't happened yet. This PR isn't a deciding factor either way.

I think we could probably backport port the gvcf blurb to the 4.2 spec since it's really more of a clarification about a specific use of VCF than a change in the spec and clearly gvcfs have existed before 4.2.

If you think we should include this note in the 4.2 spec I can move that back as well.

@cyenyxe
Copy link
Member

cyenyxe commented Feb 7, 2019

I could be fine with backporting the gVCF part, as it doesn't break compatibility and would make GATK VCFs (not a small number) fully compliant against the spec. What do you think @pd3?

@pd3
Copy link
Member

pd3 commented Feb 7, 2019

@lh3 came up with the idea of the space-holder for an unseen allele in samtools/bcftools and when GATK started using it as well, we agreed on the symbolic star allele <*>. Samtools/bcftools swiftly changed from its original X and <X>. Atlhough GATK developers were active in drafting the specification and agreed on <*> years ago, somehow it still remains an issue. I don't understand why and in my opinion allowing <NON_REF> now would set a very bad precedent.

@cyenyxe
Copy link
Member

cyenyxe commented Feb 7, 2019

The problem with <*> is that it is an extremely confusing notation (see comments from non-GATK developers in #352). We can always create a brief poll to understand how people are using them, and which one would they prefer in the long term.

But the truth is that there are many VCF files out there already using NON_REF.

@pd3
Copy link
Member

pd3 commented Feb 7, 2019

I agree the notation is not well described in the specification, but I disagree it is extremely confusing.

The correct way forward for GATK is to switch to producing <*> and silently continue supporting also <NON_REF> for backward compatibility, rather than enforce all other tools to start supporting both. It is programmatically very easy to do and if both alleles should become a standard as in this proposal, it would have to support both anyway!

@tfenne
Copy link
Member

tfenne commented Feb 7, 2019

A few thoughts:

  1. @pd3 I have personally interacted with enough folks confused by * vs <*> that I find it hard to agree that it's not confusing.
  2. I'm not sure the historical argument holds water. As @lbergelson points out in <*> allele vs <NON_REF>  #352 at least as far as he could see GATK first started using <NON_REF> in 2013, before any of the other tools mentioned used an allele for this purpose.
  3. To be fair though, there was a lot of turnover on the GATK team IIRC in the 2013-2015 timeframe, so it's possible that someone agreed to the change to <*> and that agreement got lost in the churn. I certainly don't think anyone was acting in bad faith by not changing to <*> at the time. Is there any chance that discussion happened in email/git issues vs. phone calls?
  4. As a user of VCF I feel like VCF could adopt the PERL slogan of "there's more than one way to do it" - so many things in VCF are encoded differently by different callers, having two ways of doing it seems perfectly normal.

Here's an idea: how about instead of extending the spec to officially recognize <*> and <NON_REF>, having a way to use the ##ALT header line to explicitly declare in a VCF what if any allele is used for this purpose by adding a NONREF type? And then perhaps stating that if not present <*> and <NON_REF> are used by existing tools and may be expected?

@pd3
Copy link
Member

pd3 commented Feb 7, 2019

@tfenne, on my hard drive lives the file samtools/bam2bcf.c with a timestamp from Dec 20 2011 with the following line

kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);

and very likely it is not the first version of it.

My points are following:

  1. The current GATK's <NON_REF> allele does not break the VCF specification, therefore you can continue using it.
  2. We tried to standardize this years ago and the consensus was that the symbolic star allele<*> would become reserved in the specification, not <X> and not <NON_REF>.
  3. If people find <*> vs * confusing, this is still only a problem of documenting it better.

Although I personally would have preferred to continue using the <X> notation, we switched to <*> for the sake of supporting the standard. I find it disconcerting that now, years after, the issue is still alive.

As for your suggestion, the use of the symbolic non-ref allele is very specific and I am not convinced that it is worthwhile to complicate further the ##ALT meta-information lines.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 7, 2019

My take on this, as an independent non-VCF person, is that if things don't break the specification but are in common usage then it doesn't harm to document them.

However we need to be clear to distinguish normative specification text and explanatory / clarificatory text.

If the scale of the problem is simply too vast (eg the TLEN problem of 5' to 3' vs left-most to right-most is dominated by code doing the wrong thing, because the spec changed from one thing to another and hardly anyone updated their code), then documenting both is a pragmatic alternative but not an ideal scenario. I'd be wary of blessing one over the other though - if you go down the road of permitting both then you are just accepting both will exist forever.

If you want to get only method and feel it is achievable, then don't label both as permitted - just document the unsatisfactory situation and emphasise the recommended practice is the current behaviour.

@lh3
Copy link
Member

lh3 commented Feb 27, 2019

Although I personally would have preferred to continue using the <X> notation

I prefer <X>, too, but I am fine with <*>.

on my hard drive lives the file samtools/bam2bcf.c with a timestamp from Dec 20 2011 with the following line

Bcftools was separated from samtools in samtools v0.1.9, released on 2010-10-27. Bcftools had been using X as the unspecified allele from the very beginning. As I vaguely remember, the VCF spec didn't have the concept of symbolic allele at the time. I could be wrong, though.

Personally, I am ok to add <NON_REF> as an alternative to <*> in principle, but I think <*> should be the recommended encoding in the spec. That was the decision. Also, if we add <NON_REF>, htsjdk/GATK should also support both. @pd3: would supporting <NON_REF> cause much trouble to htslib?

Here's an idea: how about instead of extending the spec to officially recognize <*> and <NON_REF>, having a way to use the ##ALT header line to explicitly declare in a VCF what if any allele is used for this purpose by adding a NONREF type? And then perhaps stating that if not present <*> and <NON_REF> are used by existing tools and may be expected?

If we add <NON_REF>, I much prefer to make them always equivalent. The VCF spec is complex enough. Having a new header line seems too complicated.

* Adding <NON_REF> to the spec and recommending its use over the <*> allele which is easily confused with *.
* closes samtools#352
@hts-specs-bot
Copy link

Changed PDFs as of b3f5e40: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of e90af6e: VCFv4.3 (diff).

@cyenyxe
Copy link
Member

cyenyxe commented Jun 24, 2019

After several offline discussions, I have updated this to acknowledge the existence of <NON_REF>, but keep <*> as the preferred allele.

VCFv4.3.tex Outdated
@@ -1362,7 +1362,7 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}.
The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele.
Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise).
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele.
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. This is preferred over the symbolic allele $<$NON\_REF$>$, used by the Genome Analysis Toolkit (GATK).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyenyxe This seems like the right solution, the community has definitely overruled my initial suggestion to promote <NON_REF> over <*>

Suggested change
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele. This is preferred over the symbolic allele $<$NON\_REF$>$, used by the Genome Analysis Toolkit (GATK).
A symbolic allele $<$*$>$ is used to represent this unspecified alternate allele.
This is preferred over the equivalent allele $<$NON\_REF$>$.

I think we have a redundant alternate in there, and we should split multiple sentences on multiple lines to match the formatting.

I think it's good to specify that they have the same meaning, and I don't know if it's necessary to include the GATK by name here since we don't usually refer to specific implementations.

It might also be useful to highlight in some way the term unspecified alternate allele if that's how we intend it to be pronounced, since my major complaint with <*> is that it's pronunciation causes confusion.

@hts-specs-bot
Copy link

Changed PDFs as of 19df07f: VCFv4.3 (diff).

@lbergelson lbergelson merged commit fb7120f into samtools:master Aug 22, 2019
@lbergelson lbergelson deleted the lb_non_ref branch August 22, 2019 15:14
jmarshall added a commit that referenced this pull request Aug 22, 2019
CRAMv3 PRs #401 and #412.
VCFv4.3 PRs #380 (<NON_REF>), #409 (infinity/NaN), and #436 (INFO/END).
All: Minor typo and whitespace formatting fixes.
thefferon added a commit to thefferon/hts-specs that referenced this pull request Oct 9, 2019
* Update CRAM spec section on substitution matrix and codes.

* Respond to review comments.

* CRAM Slice and Container ref seq IDs must match (samtools#401)

* Code review part 2.

* Fix minor typo in the predecessors of BCF2 (samtools#427)

* layed -> laid

* Update MAINTAINERS.md (samtools#432)

Proposal to add Rasko Leinonen as refget maintainer.

* add jmmut to MAINTAINERS.md and move Cristina to "Past Members"

* change order of maintainers

* Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.

* Allow C and Java native text spellings of NaN and infinities (samtools#409)

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)

* Adding a note about <NON_REF> (samtools#380)

* Update PDFs (CRAM and VCF additions; others cosmetic)

CRAMv3 PRs samtools#401 and samtools#412.
VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END).
All: Minor typo and whitespace formatting fixes.

* Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385)

Includes barebones authorizationCode Oauth2 flow, which should aid/inform
code generation.

Uses int64 with minimum 0, unsure if that is really an uint64 though.

* Codify the existing policy of generally squashing PRs (PR samtools#444)

* Permit AP_Delta in multi-ref slices.

This means AP_delta can become negative.  I have validated this
decodes fine in both htsjdk and htslib.  This is because AP is ITF8
and hence signed, like all other integers, so it would need explicit
code to forbid this (which obviously isn't in the implementations).
Hence the limitation is primarily one of an over-zealous specification.

The impact of this is for position-sorted multi-ref slices AP can
legally be stored efficiently.

Also clarified fields in the container compression header when in
multi-ref mode.

Fixes samtools#431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<*> allele vs <NON_REF>
7 participants