Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SgkitSampleData format missing data bug #892

Open
hyanwong opened this issue Jan 22, 2024 · 2 comments
Open

SgkitSampleData format missing data bug #892

hyanwong opened this issue Jan 22, 2024 · 2 comments
Labels

Comments

@hyanwong
Copy link
Member

hyanwong commented Jan 22, 2024

In a normal SampleData file we guarantee that if the genotypes contain -1 for missing data, the last item in the alleles array is None, (so that site.alleles[missing_genotype] is None). This is done when adding the alleles during add_site, here:

alleles = list(alleles) + [None]

Since we don't use add_sites for an SGkit file, this no longer holds. In other words, SgkitSampleData files can have missing data but the last item of the allele list may not be None. This could easily become a source of subtle bugs. I don't know how we should correct it? Perhaps override the site.alleles accessor in the SgkitSampleData class?

import tskit
ds = sg.simulate_genotype_call_dataset(n_variant=7, n_sample=3, missing_pct=.5, phased=True)
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
sg.save_dataset(ds, "output.zarr", mode="w")
sd = tsinfer.SgkitSampleData(path="output.zarr")
for v in sd.variants():
    if tskit.MISSING_DATA in v.genotypes:
        print("This allele list should have a `None` as the last item:", v.alleles)

Giving

This allele list should have a `None` as the last item: [b'A', b'G']
@hyanwong hyanwong added the bug label Jan 22, 2024
@hyanwong
Copy link
Member Author

hyanwong commented Jan 23, 2024

This is a little complicated, because sg_sd.sites_alleles[:] is taken straight from the sgkit file, and will include empty strings if e.g. some sites have 4 alleles and others have 2.

I suggest that in a SgkitSampleData file, we:

  1. do not change the sd.sites_alleles zarr data
  2. when we iterate over sites, we report sites.alleles without the trailing "" entries in the zarr array. Moreover, we ignore the presence/absence of missing data in the genotypes for that site, and either report them without a None appended (which could cause user errors) or simply append a None all the time.
  3. When iterating over variants, we know whether or not there is missing data in the genotype, so in the variant.alleles list (which can already differ from site.alleles, e.g. if recode_ancestral=True), we retain the behaviour seen in a "normal" SampleData file, which is to append None only if there is missing data at that site (this could also include if there is a missing ancestral allele, I guess)?

@jeromekelleher
Copy link
Member

Just a note here that the SgkitSample data is a temporary way to let use use data in sgkit format internally, without breaking too much code. We're not going to propose it as an interface for users to work with, and we will ultimately remove the SampleData interface entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants