Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check all alleles at a site are unique before inference #927

Open
hyanwong opened this issue Jun 6, 2024 · 6 comments
Open

Check all alleles at a site are unique before inference #927

hyanwong opened this issue Jun 6, 2024 · 6 comments
Labels
Milestone

Comments

@hyanwong
Copy link
Member

hyanwong commented Jun 6, 2024

I'm finding that tsinfer isn't correctly inferring the genotypes for Sgkit files in the trivial instance below. This seems pretty worrying. What do I have wrong?

import sgkit as sg
import tsinfer

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
ds['variant_allele'] = ds['variant_allele'].astype(str) # hack around https://github.com/tskit-dev/tsinfer/issues/810
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
ds.to_zarr('/tmp/ds.zarr', mode='w')
sd = tsinfer.SgkitSampleData('/tmp/ds.zarr')
ts = tsinfer.infer(sd)
print("Inferred_genos    Zarr_genotypes    SampleData_genos  Ts alleles")
for v, sgv, sdv in zip(ts.variants(), ds.call_genotype, sd.sites_genotypes):
    bad = "<<<< BAD!!!" if not (v.genotypes == sdv).all() else ""
    print(v.genotypes, sgv.values.flatten(), sdv, v.alleles, bad)

Shows that the 7th and 8th site are incorrect:

Inferred_genos    Zarr_genotypes    SampleData_genos  Ts alleles
[1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] ('G', 'C') 
[0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] ('G', 'A') 
[1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] ('T', 'A') 
[1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] ('G', 'A') 
[1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] ('C', 'G') 
[0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] ('G', 'A') 
[0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 1] [0 1 0 1 1 0 0 1] ('T',) <<<< BAD!!!
[0 0 0 0 0 0 0 0] [1 0 0 0 0 1 0 1] [1 0 0 0 0 1 0 1] ('C',) <<<< BAD!!!
[0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] ('T', 'G') 
[0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] ('A', 'G') 
@jeromekelleher
Copy link
Member

Can you show the zarr alleles fields as well please (and ancestestral state?)

@hyanwong
Copy link
Member Author

hyanwong commented Jun 7, 2024

Aha! The ds['variant_allele'] array created by sg.simulate_genotype_call_dataset is wrong: it has duplicate allele states (e.g. two Ts or two Cs)

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
print(ds['variant_allele'])
ds['variant_allele'] = ds['variant_allele'].astype(str) # hack around https://github.com/tskit-dev/tsinfer/issues/810
print(ds['variant_allele'])
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
ds.to_zarr('/tmp/ds.zarr', mode='w')
sd = tsinfer.SgkitSampleData('/tmp/ds.zarr')
ts = tsinfer.infer(sd)
print("Inferred_genos    Zarr_genotypes    SampleData_genos  SD alleles")
for v, sgv, sdv, sda, sdaa in zip(
    ts.variants(), ds.call_genotype, sd.sites_genotypes, sd.sites_alleles, sd.sites_ancestral_allele, 
):
    bad = "<<<< BAD!!!" if not (v.genotypes == sdv).all() else ""
    print(v.genotypes, sgv.values.flatten(), sdv, sda, sdaa, bad)
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 20B
array([[b'G', b'C'],
       [b'G', b'A'],
       [b'T', b'A'],
       [b'G', b'A'],
       [b'C', b'G'],
       [b'G', b'A'],
       [b'T', b'T'],
       [b'C', b'C'],
       [b'T', b'G'],
       [b'A', b'G']], dtype='|S1')
Dimensions without coordinates: variants, alleles
Attributes:
    comment:  The possible alleles for the variant.
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 80B
array([['G', 'C'],
       ['G', 'A'],
       ['T', 'A'],
       ['G', 'A'],
       ['C', 'G'],
       ['G', 'A'],
       ['T', 'T'],
       ['C', 'C'],
       ['T', 'G'],
       ['A', 'G']], dtype='<U1')
Dimensions without coordinates: variants, alleles
Attributes:
    comment:  The possible alleles for the variant.
Inferred_genos    Zarr_genotypes    SampleData_genos  SD alleles
[1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] ['G' 'C'] 0 
[0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] ['G' 'A'] 0 
[1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] ['T' 'A'] 0 
[1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] ['G' 'A'] 0 
[1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] ['C' 'G'] 0 
[0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] ['G' 'A'] 0 
[0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 1] [0 1 0 1 1 0 0 1] ['T' 'T'] 0 <<<< BAD!!!
[0 0 0 0 0 0 0 0] [1 0 0 0 0 1 0 1] [1 0 0 0 0 1 0 1] ['C' 'C'] 0 <<<< BAD!!!
[0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] ['T' 'G'] 0 
[0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] ['A' 'G'] 0 

Edit - reported as a bug in sgkit-dev/sgkit#1221

@hyanwong
Copy link
Member Author

hyanwong commented Jun 7, 2024

I didn't realise this could happen. Perhaps it's worth bugging out if there are duplicated allelic states for a site, as this is pretty confusing.

@jeromekelleher
Copy link
Member

Nothing actually checks this I guess, so worth detecting and bugging out here.

Clearly a bug in the sgkit simulator, though.

@benjeffery
Copy link
Member

Phew, had me worried there for a moment! As Jerome says we should error out on this condition.

@hyanwong hyanwong changed the title Tsinfer not replicating genotype matrix Check all alleles at a site are unique before inference Jun 7, 2024
@hyanwong
Copy link
Member Author

hyanwong commented Jun 7, 2024

I changed the name of the issue to reflect what needs doing.

@hyanwong hyanwong mentioned this issue Jul 25, 2024
@hyanwong hyanwong added this to the Release 0.4.0 milestone Aug 30, 2024
hyanwong added a commit to hyanwong/tsinfer that referenced this issue Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants