Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format of alleles in VCF #28

Open
d-cameron opened this issue May 14, 2014 · 1 comment
Open

Format of alleles in VCF #28

d-cameron opened this issue May 14, 2014 · 1 comment
Labels

Comments

@d-cameron
Copy link
Contributor

d-cameron commented May 14, 2014

The format of alleles in VCF is not formally defined in the specifications. In particular, the following edge cases exist in the current specifications:
A*A* is a valid alt allele but should not.
A<contig>A is unclear as to whether it is (or should be) valid.

I propose the following grammar productions as a formal definition for the format of alleles:

ref:
 base-string

alt:
 allele
 allele , alt

allele:
 base-string
 missing-allele
 symbolic-allele

missing-allele:
 *

symbolic-allele:
 id-string
 symbolic-insertion
 breakend
 breakpoint

symbolic-insertion:
 base id-string
 id-string base
 id-string

breakend:
 base-string null-allele
 null-allele base-string

breakpoint:
 base-string [ breakpoint-reference [
 [ breakpoint-reference [ base-string
 base-string ] breakpoint-reference ]
 ] breakpoint-reference ] base-string

breakpoint-reference:
 contig-reference
 contig-reference : digits

contig-reference:
 id-string
 contig-identifier

digits:
 digit
 digits digit

digit: one of
 0 1 2 3 4 5 6 7 8 9 0

base: one of
 A C G T N a c g t n

base-string:
 base
 base-string base

id-string:
 < contig-identifier >

contig-identifier:
 string-containing-no-whitespace-or-colon

contig-identifier is a problematic definition. Currently the spec allows [ ] < > . * as part of contig identifiers. Inclusion of the brackets as valid characters are especially likely to cause difficulties with implementations as alleles such as N[<[>[>[ are currently valid, and the string is an ambiguous reference either to a (contig-identifier) reference contig "" or a (id-string) named contig CHR.

@d-cameron
Copy link
Contributor Author

What is the most appropriate mechanism for contributing to specifications? Provide pull requests? Is there a mailing list for spec discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants