-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow partially spanned alleles. #1413
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1413 +/- ##
===============================================
+ Coverage 68.115% 68.121% +0.006%
- Complexity 8375 8376 +1
===============================================
Files 573 573
Lines 33978 33978
Branches 5668 5668
===============================================
+ Hits 23144 23146 +2
+ Misses 8644 8643 -1
+ Partials 2190 2189 -1
|
Eh... I'm not a fan of this. I think we're going to make it explicitly against spec unless someone can make a REALLY compelling case. Do you think this representation is a useful one, or do you just want to be able to read octopus files? Would the octopus authors be opposed to changing their representation in future versions? I think it could make sense to extend the definition of * to be "spanning allele" rather than merely spanning deletion which would work for their second use case. It's not clear to me why the first case can't be represented as a complex event rather than a deletion followed by a partially deleted allele, or just a deletion followed by a change a few bases later so it's not partially deleted. Anyway.. that is probably discussion for the specs issue instead of here. I really don't like making this wok by default because I don't think we want to encourage it. If you wanted to add it as a weird mode that could be controlled by one of the flags in Defaults I think I'd be happier with that. How would that work for you? |
Yeah, I thought you might have a reaction like that @lbergelson. My primary concern at this point is that I just want to be able to read Octopus VCFs without resorting to silly things like e.g. modifying the input stream on the fly on the way to the VCF parsing code! @dancooke can give a better answer on what he, as the author of Octopus, thinks about the long term usage of this style, but my sense is that he is committed to outputting VCFs with a single ALT per record, and using phasing and spanned alleles to stitch things together. Bigger picture, it looks like this was first brought up on htspecs back in June 2019 (samtools/hts-specs#151) and again more recently by me about a week and a half ago (samtools/hts-specs#437). I tend to think, actually, that Dan's use of partially spanned alleles in Octopus is quite nice and once you understand it, I like it better than using just In the short term I don't have any serious issues with having to specifically enable this behaviour via a -Dallow_partially_spanned_alleles or similar if that would get a 👍 from you. |
I think the best case that I can make is that it provides a neater definition that is more in line with the rest of the specification. The working definition that I've used for
While the meaning currently being used by GATK appears to be more like
So suddenly we go from defining A more practical benefit of the base- |
@dancooke: What do you mean by ‘upstream record’? Another record earlier in the same VCF file? (Records referring to other records seems problematic if the VCF file later gets filtered…) |
@jmarshall A record with smaller or equal |
@jmarshall The same problem exists either way - if you use either @dancooke's representation or the standard Most of the work I'm doing these days focusing on either a single sample or a very small number of samples at time where frankly combining events/alleles that overlap or are within a few bp into a single VCF record is preferred. But I understand that for larger cohorts this gets extremely messy and there's a lot of compound events, and emitting more simple records works better, hence the overall desire for spanning alleles. I tend to agree that @dancooke's representation is easier to interpret once you understand it. I want to give a toy example that makes it clear why this is preferred sometime, and attempt to show how the representations differ.
When reconstructing the two haplotypes in the sample, the latter representing makes it clear that the prior variant spans (deletes) two bases from the second ALT, and then the that allele picks back up with |
@tfenne Sorry, I didn't mean to totally shut this down in htsjdk. I figured you were going to come back with a change that adds a flag in defaults. I'm happy to accept a pr that adds an opt in for this, I just don't want picard/gatk suddenly falling over in confusing ways when they encounter these things. If we decide to not ban it in the spec we can come back and add better support in a later version. I don't want to make any decision about it in the spec before I talk with a few people who are on vacation right now. People are trickling back in after labor day. |
@lbergelson no worries - I wasn't taking this as 100% blocked, but rather figured it was useful to try and push the spec discussion forward some because it would be better to have the spec clarified and then just do the right thing here. If I get to the point where I'm blocked before the spec issue resolves I'll take a look at making this opt in. Sorry if I wasn't clear. |
I like the idea of having a STAR_IS_A_BASE flag in the Defaults class to enable reading of Octopus vcfs. Concurrently there should be discussion in hts-specs about the base-status of star. |
@tfenne this is in your court now to implement a "Defaults" parameter to unlock these "malformed" inputs. |
Description
@lbergelson this is likely to be a slightly contentious change despite the small PR. The goal here is to support partially spanned alleles in the form they are used in octopus. See luntergroup/octopus#75 for a description of what's going on, but the short version is that Octopus outputs alleles like
*AT
and**GCGA
to represent cases where the first or first two (respectively) bases of the allele are spanned by an upstream event.This isn't prohibited by the VCF spec, although it seems like it's an unintentional thing that alleles like this are spec compliant. Still, I'd like to be able to use HTSJDK's VCF API to read octopus VCFs and it currently blows up on records with alleles like this.