Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow partially spanned alleles. #1413

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Aug 26, 2019

Description

@lbergelson this is likely to be a slightly contentious change despite the small PR. The goal here is to support partially spanned alleles in the form they are used in octopus. See luntergroup/octopus#75 for a description of what's going on, but the short version is that Octopus outputs alleles like *AT and **GCGA to represent cases where the first or first two (respectively) bases of the allele are spanned by an upstream event.

This isn't prohibited by the VCF spec, although it seems like it's an unintentional thing that alleles like this are spec compliant. Still, I'd like to be able to use HTSJDK's VCF API to read octopus VCFs and it currently blows up on records with alleles like this.

@tfenne tfenne requested a review from lbergelson August 26, 2019 18:05
@tfenne tfenne self-assigned this Aug 26, 2019
@codecov-io
Copy link

codecov-io commented Aug 26, 2019

Codecov Report

Merging #1413 into master will increase coverage by 0.006%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #1413       +/-   ##
===============================================
+ Coverage     68.115%   68.121%   +0.006%     
- Complexity      8375      8376        +1     
===============================================
  Files            573       573               
  Lines          33978     33978               
  Branches        5668      5668               
===============================================
+ Hits           23144     23146        +2     
+ Misses          8644      8643        -1     
+ Partials        2190      2189        -1
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/variant/variantcontext/Allele.java 77.703% <ø> (ø) 95 <0> (ø) ⬇️
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 90.476% <0%> (+9.524%) 4% <0%> (+1%) ⬆️

@lbergelson
Copy link
Member

Eh... I'm not a fan of this. I think we're going to make it explicitly against spec unless someone can make a REALLY compelling case.

Do you think this representation is a useful one, or do you just want to be able to read octopus files? Would the octopus authors be opposed to changing their representation in future versions?

I think it could make sense to extend the definition of * to be "spanning allele" rather than merely spanning deletion which would work for their second use case. It's not clear to me why the first case can't be represented as a complex event rather than a deletion followed by a partially deleted allele, or just a deletion followed by a change a few bases later so it's not partially deleted.

Anyway.. that is probably discussion for the specs issue instead of here. I really don't like making this wok by default because I don't think we want to encourage it. If you wanted to add it as a weird mode that could be controlled by one of the flags in Defaults I think I'd be happier with that. How would that work for you?

@tfenne
Copy link
Member Author

tfenne commented Aug 26, 2019

Yeah, I thought you might have a reaction like that @lbergelson. My primary concern at this point is that I just want to be able to read Octopus VCFs without resorting to silly things like e.g. modifying the input stream on the fly on the way to the VCF parsing code!

@dancooke can give a better answer on what he, as the author of Octopus, thinks about the long term usage of this style, but my sense is that he is committed to outputting VCFs with a single ALT per record, and using phasing and spanned alleles to stitch things together.

Bigger picture, it looks like this was first brought up on htspecs back in June 2019 (samtools/hts-specs#151) and again more recently by me about a week and a half ago (samtools/hts-specs#437). I tend to think, actually, that Dan's use of partially spanned alleles in Octopus is quite nice and once you understand it, I like it better than using just * when the allele is only partially spanned, which I have seen the GATK do. Either way though, I would just like the spec to modified to either explicitly embrace this usage or disallow it so that we can all get back to more fruitful things!

In the short term I don't have any serious issues with having to specifically enable this behaviour via a -Dallow_partially_spanned_alleles or similar if that would get a 👍 from you.

@dancooke
Copy link

I think the best case that I can make is that it provides a neater definition that is more in line with the rest of the specification. The working definition that I've used for * is something like

This base is explained by an upstream (or matching `POS`) record.

While the meaning currently being used by GATK appears to be more like

Some or all of the leading reference bases in this allele are deleted - as explained by an upstream (or matching `POS`) record.

So suddenly we go from defining ALT alleles in terms of nucleotides to defining them in terms of both nucleotides and symbolic alleles (but distinct from the other symbolic alleles the spec permits). I'm not suggesting both definitions can't result in semantically equivalent information (w.r.t describing haplotypes), but I think the former is both clearer and more expressive.

A more practical benefit of the base-* definition is that reconstructing haplotypes from VCF is a little easier as there's no need to keep track of information from previous records (other than the sequence), since each record specifies exactly how many reference bases must be ignored.

@jmarshall
Copy link
Member

jmarshall commented Aug 27, 2019

@dancooke: What do you mean by ‘upstream record’? Another record earlier in the same VCF file?

(Records referring to other records seems problematic if the VCF file later gets filtered…)

@dancooke
Copy link

@jmarshall A record with smaller or equal POS (so yes, an earlier record in the same VCF if it's co-ordinate sorted). Problematic how, for reconstructing haplotypes? How is this different to the 'symbolic' interpretation? If a previous record is filtered, the reference is implicitly implied. RTGTools vcfeval (3.10.1) supports partially overlapping alleles with * and I haven't encountered any issues with filtered records.

@tfenne
Copy link
Member Author

tfenne commented Aug 27, 2019

@jmarshall The same problem exists either way - if you use either @dancooke's representation or the standard * allele. I.e. if you have a deletion that spans a SNP variant, the GATK-style encoding would be to emit the deletion, and then a SNP with ref=A, ALT=*,T GT=1/2. If the deletion is then filtered but the SNP is not, you still have to make a decision at the SNP about how to interpret the * allele - does the genotype become 2/2 because you only observed allele 2 and now you're inferring that the other haplotype isn't deleted? Or do you make it 0/2 assuming that it's reference since the deletion no longer spans it. It's a general issue with spanning alleles, not specific to any one representation of them.

Most of the work I'm doing these days focusing on either a single sample or a very small number of samples at time where frankly combining events/alleles that overlap or are within a few bp into a single VCF record is preferred. But I understand that for larger cohorts this gets extremely messy and there's a lot of compound events, and emitting more simple records works better, hence the overall desire for spanning alleles.

I tend to agree that @dancooke's representation is easier to interpret once you understand it. I want to give a toy example that makes it clear why this is preferred sometime, and attempt to show how the representations differ.

# GATK representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,*       ... 1|2

# Octopus representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

When reconstructing the two haplotypes in the sample, the latter representing makes it clear that the prior variant spans (deletes) two bases from the second ALT, and then the that allele picks back up with GCGA (i.e. the reference bases). When encountering variants like this with GATK you have to do extra work based on positions and allele lengths to determine that the * allele doesn't indicate that the second ALT is fully spanned by the preceding variant. I have been bitten by this previously where I'm trying to reconstruct the local haplotype across variants, and based on the spec had interpreted * to mean an allele was fully spanned by an upstream variant.

@lbergelson
Copy link
Member

@tfenne Sorry, I didn't mean to totally shut this down in htsjdk. I figured you were going to come back with a change that adds a flag in defaults.

I'm happy to accept a pr that adds an opt in for this, I just don't want picard/gatk suddenly falling over in confusing ways when they encounter these things. If we decide to not ban it in the spec we can come back and add better support in a later version.

I don't want to make any decision about it in the spec before I talk with a few people who are on vacation right now. People are trickling back in after labor day.

@tfenne
Copy link
Member Author

tfenne commented Aug 30, 2019

@lbergelson no worries - I wasn't taking this as 100% blocked, but rather figured it was useful to try and push the spec discussion forward some because it would be better to have the spec clarified and then just do the right thing here. If I get to the point where I'm blocked before the spec issue resolves I'll take a look at making this opt in. Sorry if I wasn't clear.

@yfarjoun
Copy link
Contributor

I like the idea of having a STAR_IS_A_BASE flag in the Defaults class to enable reading of Octopus vcfs. Concurrently there should be discussion in hts-specs about the base-status of star.

@yfarjoun
Copy link
Contributor

@tfenne this is in your court now to implement a "Defaults" parameter to unlock these "malformed" inputs.

@yfarjoun yfarjoun added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants