Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

Open
hyanwong opened this issue Jan 23, 2024 · 1 comment

Comments

@hyanwong
Copy link
Member

hyanwong commented Jan 23, 2024

In the VCF format, indels are encoded by using adding the letter before the variant to the allele list, and reporting the position of the variant as the indel position minus one (see 5.1.1 in http://samtools.github.io/hts-specs/VCFv4.4.pdf). E.g.:


ATGC---CC
ATGCGGGCC
123456789 (position)

Would be reported as a variant at position 4, with alleles "C" and "CGGG", rather than a variant at position 5 with alleles "" and "GGG". The initial C in the allele string is taken from the reference genome sequence. However, if there is also a SNP at position 4, we would register two sites, both at position 4. Tskit does not allow sites with duplicate positions, so this situation can't be easily encoded in a tree sequence, even though it's a valid representation of genetic variation.

Other than mask such sites out (if there are any), we could re-encode such sites as at position 5 with one blank allele (""). However, this could cause problems for interpreting the SGkit format, as we currently calculate allele numbers etc on the assumption that trailing "" values in the alleles list can be removed.

This is probably not much of an issue for human data. But in some other datasets, such as Anopheles, every other base is a SNP, so I imagine this situation occurs quite often.

Note that there is a corner case of this corner case: when the indel is at the start of the sequence, the reference letter after the site is appended.

@hyanwong
Copy link
Member Author

hyanwong commented Feb 2, 2024

On reflection, I think that the SGkit format should use the VCF style (i.e. use the position one base pair before the actual indel, e.g. pos=9, REF="ATCT", ALT="A"), whereas tsinfer should translate that to a tree sequence which has a position at the indel site (pos=10, REF="TCT", ALT=""). That way we can reasonably encode indels with an adjacent SNP.

The TreeSequence.to_vcf() method should be able to deal with tree sequences of both forms, however (outputting the VCF-compatible form in both cases). I'm unclear what we would use as the reference sequence letter to append to the allele string in this case, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant