-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437
Comments
Yes, the intention was to allow either |
Some related discussion here that I missed previously: #151 |
@lbergelson, @pd3 & @jmmut pinging you guys since you are the currently listed VCF maintainers. Is there a path to making decision here in the near future? It's becoming a little awkward since Octopus uses alleles of My sense is that since this isn't explicitly dis-allowed by the spec and that @dancooke has shown a good use case for alleles that start with If folks are on board with formalizing the usage from octopus I'm happy to put together a PR to the spec. And if that's a non-starter could someone else make a PR to clarify the restrictions on use of |
@tfenne I'm not totally sold on this. It's not explicitly disallowed by the spec, but it was definitely not intended. I see how it's useful for figuring out a partially spanned allele, but it seems like it could add a lot of complication to an already error prone and poorly supported feature. Is the intent to support alleles starting with |
@lbergelson I would argue that the 'symbolic' interpretation adds even more complexity as it adds a completely distinct concept. The base- I think the definition I gave implicitly requires any |
@lbergelson What I really want is for the spec to be 100% clear on what is and isn't supported, and for HTSJDK to be in line with that, and for Octopus to generate VCFs that are spec compliant. I tend to think clarifying the spec to allow one or more |
I prefer to disallow an ALT like |
There is actually a more fatal problem with the symbolic
Both of these examples are actually inconsistent as the first record states the reference (
But this isn't particularly satisfactory as it doesn't make a strong claim about the reference between
Then the output would be
Which is consistent and makes it simple to reconstruct haplotypes as there no need to remember position information between overlapping records. |
The haplotype alignment is: Pos: 1234567890123456789012345
Ref: XXXXXXXXXGTATATAGCGAXXXXX
H1: XXXXXXXXXGTATAT-----XXXXX
H2: XXXXXXXXXG------GCGAXXXXX Your final VCF is wrong. The first line should be: chr1 1000 GTATATA G,GTATAT* ... 2|1 A VCF parser can only detect this error when it comes to the second line. To validate a VCF, you always need to look at multiple lines. Your proposal is not making things simpler. Furthermore, your proposal allows inconsistencies between the two lines. Wrong second lines could look like: chr1 1005 TAGCGA T,**GCA ... 1|2 or chr1 1005 TAGCGA T,****CA ... 1|2 A proper design should try to minimize such errors. The right way to represent this alignment is: chr1 1000 GTATATA G,GTATAT 2|1
chr1 1005 TAGCGA T,* 1|2 It is easy enough to reconstruct the alignment. There are fewer ways to create wrong VCFs. |
Nope. The
Completely disagree. You're permitting duplicate information in one direction but not the other. Not only is this inconsistent, it's also very confusing - It looks like the first record is calling a 1bp deletion, but you don't know that this isn't the case until you see the next record, and the first gives no indication you must do this (unlike the way I proposed).
I think this is a poor argument. For a start, the number of tools generating VCFs de novo is small compared with the number that parse/manipulate them, and I'd strongly argue that the base |
Would also be good to hear @Lenbok's thoughts on this - he added support for base |
Then we are interpreting your It is worth mentioning the freebayes way to encode this VCF. See a more complex example: Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX The freebayes way is to use CIGAR for a series of interlocking variants: chr1 10 GTATATAGCGA GTATAT,GGCTGA CIGAR=6M5D,1M6D2M1I2M This is the best way to describe both haplotypes at the same time. On this example, your attempt to describe both haplotypes will become more complex. I guess there should be ways with your method, but I can't think of one for now. With the current chr1 10 GTATATA G,GTATAT 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1 PS: the major problem with the freebayes method is that it becomes awkward when there are a 6kb deletion on one haplotype and multiple small variants on the other haplotype. How do you deal with that in your representation? The more I think of this problem, the stronger I feel we should forbid |
Another issue to take into account is that the current spec doesn't mention that the record where the overlapping deletion is described needs to appear before the current record, so would this be a legal VCF?
This has less redundancy and looks more straight to the point, but of course makes haplotype reconstruction harder by having to look at later records. As long as the spec is clear and easy to understand I'm ok with any approach. At the very least I think we should address the current wordings:
If we instead adopted the partial spanning notation, it should make clear the meaning of
What I understand is that the purpose of each I'm slightly in favour of just fixing the wording of the current simple approach. |
The definition I'm suggesting is:
Note that this does't only apply to deletions, or upstream records (as currently required), in addition, there is no requirement to introduce a new type of 'symbolic' allele. Using this definition, your example would be reported as:
This has several advantages over the symbolic representation you show:
Just to stress that last point, let's look how you would represent your example without the first deletion:
So not only do you end up completely changing representation (going from symbolic
Consistent with the full example. No duplicate information - and therefore less error prone. I really can't understand how you find this more complex or confusing? You already point out the main reason that the FreeBayes representation isn't viable - it doesn't scale. It's also hard to read a complex cigar string, and some complex substitutions can't be properly represented with a cigar (e.g. micro-inversions). I'd also suggest that it makes matching and annotating variants from variation databases harder since simple variation gets buried into long haplotypes. Consider one more example, a complex substitution and an SNV. The 'FreeBayes' way would be:
There's no way to 'decompose' this (without using
Now the SNV is clear, and can easily be annotated. |
@jmmut I was thinking about this on a flight. Now for my example: Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX my preference is: chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1 Here
When there are overlapping variations, you anyway need to read multiple records to reconstruct the haplotypes. The key information base |
@lh3 I really don't like this. It's weird to have one rule for one direction and a completely different one for the other. You're suggesting entirely changing the definition of
and we'd be right back to the situation without @jmmut This is at least consistent, but the advantages of base
The only downside (that I can think of) is that |
I believe what I described is just an elaboration of the current approach. It is simple to generate: describe each atomic allele individually. It is also simple to parse: reconstruct the haplotype alignment based on the current record, discard the part of the haplotype beyond the the next record and write a new allele. It is imperfect, but it gets the job done and is compatible with existing ecosystem. Your complex substitution example is actually more about multi-allelic sites. For example:
In bgt, I have long been using
Symbolic allele On three haplotypes, we can use
to indicate genotypes involving two overlapping deletions, if we really want. In practice, it doesn't matter either way. The parser can know the correct haplotypes based on the upstream records. Several people in this thread thought base
At position 15, the parser needs pos 18 to construct the haplotype alignment, and when it comes to pos 18, it needs to memorize the record at pos 15. The is the result of the insertion. Furthermore, the redundancy across records is particularly concerning as it gives us more ways to create wrong records. I designed sam, bam, gfa and bcfv2. Redundancy like this is one of the first things I would strive to remove. |
I strongly oppose any format (re)definition that requires any information that relies in any way on record ordering or for which the variant call for a single record cannot be determined by reading that record in isolation. Common operations that will break such a definition are:
There is definitely room for improvement in defining phasing information but breaking steps that are used in almost every downsteam processing pipeline is something I am strongly opposed to. Somewhat of a pity we didn't get around to fixing this given it was raised this as a problem 5 years ago: |
I agree with @d-cameron. It seems that this discussion is about redefining VCF to incorporate a "graph style" representation of alleles and support breaking large haplotypes into simpler nested loci. If that is the intent, then let's carefully define the semantics AND document them in the specification. e.g. allowing overlapping records in VCF has been a nightmare for tool developers because everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way. If we do wish to expand VCF semantics, then let's consider that most attempts at doing so have redefine either
Both of those redefinitions are incompatible with other interpretations of the standard. e.g. why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined. This is distinct from using the non-ref allele <*> because the uncalled allele may in fact be reference (just at an insufficient quality level to call). @dancooke : Just as a data point, I understand that you consider the multi-allelic VCF representation as "unscalable", but I would probably be using Octopus now if it supported that VCF syntax. Having to re-do my entire clinical analysis pipeline for yet another experimental VCF dialect is a non-starter. |
Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.
In my view, the solution to overlapping records is not to eliminate them, but to impose a consistent representation.
Let's use my example again:
Since the introduction of
My impression is most tools use the same way? Are you suggesting replacing EDIT: hmm... Rereading @kevinjacobs-progenity's comment, I realize that he only pointed out the problems with both |
@kevinjacobs-progenity I agree with most of this. I ended up using Regarding 'multi-allelic'/MNV calls. Scalability is only half the reason; more fundamentally, I think that variant calls should be interpreted as mutation event calls. So If I report @lh3 There's actually no need to 'look ahead' - you can just keep a record of which reference bases are yet to be determined and fill them in when you see the next record. I agree it's generally a good principle to reduce redundancy, but I think in this case it's worth the relatively small cost in order to avoid inconsistency and potentially reduce parsing errors. |
At pos 10, neither To close the loop: at one point, I was thinking to use something like the following at pos 10 (note that this is slightly different from the example above)
The problem is we normally assume the 2bp del on hap1 is placed at pos 11 (this is another problem with multi-allelic VCF), but here we want to put the 2bp del at pos 15, so we would need a CIGAR field to distinguish the two cases. The price to achieve technical correctness is too high. I wouldn't go this way. Base |
I went back to bgt and noticed that bgt converts my example to the following VCF:
Here
which would be technically correct under the new |
Coming from the SV world, I don't really appreciate why overlapping records are so problematic for indels. Is it because downstream analysis pipeline break on unphased overlapping records, because left-alignment of phased records results in an self-contradictory haplotype, because caller don't reliably report phasing blocks, because VCF phasing doesn't work in the presence of aneuploidy (this is a huge issue for somatic SVs), because the non-uniqueness of variant and haplotype representation makes cohort analysis and variant comparison too complex, because the 'optimal' representation (e.g. 2 SNVs vs 1 MNV call) is different for different downsteam analyses, or something else? My naive assumption has been that SNV/indel callers determing haplotypes, and reporting the left-aligned set of variants corresponding to the set of SNVs and indels in the haplotype-reference alignment would give a representation without any of the above issues. Why doesn't this work? |
@d-cameron It's not really indels specifically. There are three issues in my mind:
|
I see three viable solutions:
Note that the alignments for the examples are the same as first given by @lh3 here:
My preference is option 2 or 3, with the |
@dancooke, can you edit your last comment to show the underlying alignments so that people can comment on your proposal? In general, this problem is too complex and I don't believe it can be solved within VCF. This is in essence an attempt to represent a graph of arbitrary complexity. |
@pd3 Are you suggesting that under your proposed re-definition of |
Htsbox has been using
This changes the semantics of VCF. A record in today's VCF effectively describes the interval maximally between this and the next record. With your change, a record will describe the interval of the current record. This is a subtle but important difference. Your proposal sounds better in theory, but will have problems in practice. Suppose we have a 100kb deletion in a population. If Earlier I said "[redefining |
the spec says "overlapping deletion", did you mean change to "overlapping allele" or remain as overlapping deletion? |
@jmmut Yes, I meant "overlapping allele". Can't think of any harm it might cause, happy to be proven wrong though. |
@pd3 Then I'm lost too because this is the very issue that we were trying to avoid by staying with the "0 means not changing the REF allele at this record". This sounds to me doing both options I described, and having the disadvantages of both. (Do the rest think it's ok to change the spec to both "0 means not changing the REF allele at this record" and @lh3 I'm sorry but I don't understand what problem this scenario represents. Could you elaborate, please? Would this still apply if we did both changes?
Also, with the "0 means not changing the REF allele at this record" interpretation, I don't understand (just as @dancooke) if these are semantically equivalent to @pd3 or not, and if yes, why |
@jmmut I believe that @lh3 is describing the following situation. Suppose we have (I'm going to use the most explicit base
Now suppose that we cannot phase the variants at
Which accurately describes the situation. The solution for symbolic
|
Ok, that makes sense, thanks. But if you don't know the phase, why not using the unphased separator
As I understand it, PS is for grouping variants under the same phase when you don't know the absolute phase of that group, right? but s1 can be absolutely phased and there are no phase groups for s2 (in these variants). I'm not aware of any restriction mixing Also, I still don't see why is this an argument against |
@jmmut You could do that for this example. More generally, I like to use I don't think this is an argument against symbolic |
Here's a summary, how I see it
Then, VCF v4.2-3 could either
I would advise going with option 1, but having a new version of VCF ready that re-defines |
I was in a full-day retreat. There are three definitions for the
@jmmut let's use your example:
s2's genotype at pos 10 is in fact not determined. It could be
It is ok. Definition 3 also works. Between 2 and 3, def 2 is more convenient as we can know the REF allele count up to the next record. More importantly, as the current spec has already adopted 2, I see little reason to revoke our earlier decision. Going forward, we should just follow our current practice: clarify |
@lh3 The 'problem' you describe is exactly as I thought, but I explained quite clearly why this is not an actually problem. You are mistaken I'm afraid.
This is simply unacceptable since it completely ignores the elephant in the room that the working definition of Going forward, we can choose to define
Option 1 would be a shame since the reason why
Just to stress the first point. Let's suppose I'm calling germline/somatic variants in a tumour sample and get (using option 3)
Now I'm only really interested in somatic variants, so I filter my VCF for these (this is a very common thing to do):
So I'm now left believing that my somatic variant occurred in the context of a tandem repeat (the reference), but this is not the case. This is exactly the type of situation that will mislead analysis. Any of the choices of option 2 have none of these issues. At this point I would settle for the first 'symbolic' |
No, you didn't. Slightly modify your example:
If phasing is unavailable, do you want to turn
With |
I've been following along with this, but I'm on vacation with my family and each time I try to reply I get interrupted part way through, when I return the argument has moved. I'm unconvinced that there is anything catastrophically wrong with the current use of the * allele. We do definitely need to clarify what It does seem like it's currently difficult to specify complex exact matches to reference using the existing * in some of the cases you mentioned. In general VCF hasn't done a very clear job of distinguishing when we believe something that isn't represented is reference vs when it is not defined. How you interpret unspecified sites in the vcf mean depends on your assumptions about the provenance of the vcf and isn't explicitly encoded. For instance, in the following, what is the genotype in any spot between 10 and 20?
It's unclear if it's a confident 0|0 or if it's unknown, and which it is depends on the provenance of the vcf. If we wanted to explicitly specify it we would need to add a homref block.
Maybe we could clarify some of these examples in a similar way by adding a tag which can be used to specify exactly how long a match to the reference is called. Something like the proposed RBS. Here's an example using Haplotype-Reference-Block-Size (HRBS): a per haplotype integer value that specifies the number of reference bases included in a 0 call for that haplotype.
It seems like you want to be able to explicitly encode this information, but I think doing so in the alleles is not the best option. @dancooke Your proposed solution using * and # bases in alleles is problematic for processing large cohorts. Based on this example:
For a large deletion in 1 sample, you would need to introduce specific alleles for every other sample which act like a mask for the specific pattern of overlapping alleles in each sample. This prevents finalizing this site until the entirety of the length the deletion has been processed. In a million sample vcf holding all the relevant information in memory becomes tricky with an approach like yours. It would also require you to add up to The existing * allele instead requires you only to hold the alleles in memory which are still potential overlappers. For a 100kb symbolic deletion it's not clear what you would do with your scheme, just fall back to the existing * allele behavior? Introduce a million new symbolic alleles and record the exact bases somewhere? As a side note, I would be in favor of redefining * to be equivalent to Filtering a vcf with |
@lh3 Your modified example is ill formed because you cannot have the @lbergelson I don't believe that you can confidently claim that the mass of VCFs are assuming this definition of Regarding reference calls, while I agree this is a tricky topic, I do believe that the best way forward is in terms of explicitly called alleles. If I call
then I claim to have considered the two called haplotypes within whatever genotype model I'm using. In particular, I claim to have compared the reference base against a deleted base at each position where the reference has been called. From a haplotype perspective, I could just as well have called
since the haplotype calls are the same (I would however argue that the overall interpretation is slightly different). I would then presumably report measures of uncertainty from my model (e.g. By adopting the GATK definition for
With this set of calls you make an entirely different claim about what the model has considered at positions 13 and 14, and cannot report the same measures of uncertainty as for the explicitly called bases. You raise good points regarding the practicality of base I don't understand how you would filter my germline/somatic example while retaining the information from the germline. Under the explicit
so when I filter for
which is considerably better than having |
Why? Either way, that allele doesn’t matter. What matters is how you encode the true reference allele without phasing. |
@lh3 Because then you're asserting the values of positions at 12 and 14 but re-defining them later.
This is an oxymoron. You can't state true reference for this allele if you don't know the phase. It would be like me asking how you can assert the reference in
if I don't know the phase between the two bases. The correct way to encode the situation is
since the allele |
I still don’t understand why this is wrong (oh, a typo: it should be
Are you thinking about 4-gamete test? |
Ah sorry, I missed that you'd modified the example. The correct way to encode this would be
If the region is un-phasable for
|
Then |
@lh3 It is not indeterministic at all - it is precisely determined when the phase is known. |
It depends on how you define indeterministic. A definition of 0 depending on phasing is indeterministic IMO. Think this way. With def 2, the number of 0 alleles exactly corresponds to the number of ref alleles up to the next record. With def 1, the number of 0 alleles doesn’t mean much. Some other ref alleles may be hiding behind *. Def 3 doesn’t say about the ref allele, so it doesn’t matter. |
It is not |
Another data point. The Genome In A Bottle VCFs (v3.3.2) - arguably some of the most important and widely used VCFs publicly available - all specify VCF v4.2 yet do not use the Extract from
|
- Clarify that * can not be mixed with other bases - "* is not a base" in VCFv4.4 too - add regex for ALT - add ALT regex for VCFv4.4 too There are 2 other points where we haven't reached a conclusion: - Clarifying that a GT=0 doesn't mean "true reference", and if it's possible to state "true ref" at all. - We want to make * to mean "overlapping allele" instead of "overlapping deletion" but the previous point on GT=0 needs a conclusion before we can make a proper update to *. This also will require clarifications in relation to <*>, which now will be even more confusing to users.
The VCF spec contains this text on ALTs:
I think probably the intention here was to allow either
*
or[ACGTN]+
, but not a mixture of the two. But as written it theoretically allows alleles like*A*C*G*T*
, but has nothing to say on how such alleles should be interpreted.I've recently come across VCFs generated by Octopus that contain alleles that start with a
*
and end with base sequence, e.g.*AC
. This is technically valid VCF, but I suspect most tooling won't support it (GATK tools fail on it). In their case they seem to be using it to indicate that the anchor base in a deletion record is covered by an upstream deletion, but not the whole allele.I think the spec should either be updated to clarify this kind of usage or if the horse is already out of the gate, prohibit it.
The text was updated successfully, but these errors were encountered: