Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the star allele (*) considered symbolic or not? (a discussion about VC types) #151

Open
yfarjoun opened this issue Jun 11, 2016 · 4 comments
Labels

Comments

@yfarjoun
Copy link
Contributor

The VCF spec discusses symbolic alleles as an angle-bracketed ID String “<ID>” (in 1.6.1.4) but the overlapping deletion allele is *. I suspect that the intention is that the star allele be considered a symbolic allele. The specific deletion which is overlapping can depend on the sample/genotype and thus cannot be said to be a specific allele which is simply not spelled out.

In HTSJDK a VariantContext has a "type", as does an Allele. This isn't spelled out in the VCF spec and so I'm not sure if other VCF parsers do this as well (and if they do, whether it is using the same definitions...). The classification seems to be based on this.

Currently, since the star allele isn't considered symbolic, the VariantContext with it is considered a SNP (all the alleles are of length 1). I would like to change that but am concerned that there are issues that I haven't considered.

Since Allele type and Variation type are not specified in the VCF spec (as far as I could see), different implementations are thus free to do what they wish, but I suspect that we should decide as a community how to approach this so that we can agree on the meaning of basic things like "how many SNPs does a VCF have?"

@jmarshall jmarshall added the vcf label Sep 1, 2016
@yfarjoun
Copy link
Contributor Author

Tumbleweed?

Does no-one care or no-one has thought about this or no-one has a strong opinion one-way or another?

I'll put in a spec-change PR and see if that will make more people chime in..

@d-cameron
Copy link
Contributor

There is already a star alternate allele <*> defined in VCFv4.3 section
5.5 which is different from the section 1.6.1.5 * "missing" alt allele.
Unfortunately, the wording of 1.6.1.5 seems to indicate that AN*G*GG is a
valid alternate allele (and that case insensitivity is explicitly allowed
which, if actually implemented, breaks SV symbolic alleles).

I raised an issue with the htsjdk allele class design a while back (see
samtools/htsjdk#18). My preference is for an API
design that can distinguish between SNVs, SVs, and both star alleles, but I
think that is an implementation issue, not a specifications issue.

On Sat, Sep 24, 2016 at 1:57 PM, Yossi Farjoun notifications@github.com
wrote:

Tumbleweed?

Does no-one care or no-one has thought about this or no-one has a strong
opinion one-way or another?

I'll put in a spec-change PR and see if that will make more people chime
in..


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#151 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFwcOKsAYO3R6P-Sp9Mv-i3EwZTA4icnks5qtJ-0gaJpZM4Izcwl
.

@pd3
Copy link
Member

pd3 commented Oct 2, 2016

The star allele * is not a SNP. For example, if there is a big deletion in one sample and another sample has a SNP in the deleted sequence, there is the question how to represent it: 0/0 would mean the reference allele, which it is not. One could use the missing genotype ./., but that could also mean that the genotype could not be determined. The star allele allows us to represent this situation.

The term "symbolic allele" refers primarily to anything enclosed in brackets <>. In a broader sense, the term is often used to describe situations where the sequence of the alternate allele is not or cannot be given explicitly. For example all the SV events. Or the "unobserved allele" <*> which is used as a placeholder to express all genotype likelihoods.

Strings like AN*G*GG should not be allowed. I don't really know what it'd be good for or how to interpret it.

@Lenbok
Copy link

Lenbok commented Dec 14, 2018

The specification is currently vague about whether the use of * to represent a spanning deletion must use * as the whole allele.

"Options are base (sic) Strings made up of the bases A,C,G,T,N,*, ..." makes it seem like * can be freely mixed with regular bases.

"The ‘*’ allele is reserved to indicate that the allele is missing due to an overlapping deletion." makes it seem like representation of spanning deletion should use * as a whole ALT.

In the case of an insertion or deletion that coincides with the edge of a spanning deletion, the requirement to add an anchor base would mean that either the boundary of the spanning deletion is being implicitly moved, or the anchor base must also be added to the spanning deletion allele.

Similarly in the case of two partially overlapping deletions, you might want to add bases to each spanning deletion allele to indicate where the overlapping deletion stops.

The alternative is to disallow mixing * with other bases and thus use of this allele does not imply that the entire corresponding haplotype has been deleted (although this reduces it's utility).

Note that the Octopus variant caller (from @dancooke) uses this "partial spanning deletion" notation currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants