Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgkit: Support call_genotype_mask? #786

Closed
benjeffery opened this issue Dec 21, 2022 · 2 comments · Fixed by #791
Closed

sgkit: Support call_genotype_mask? #786

benjeffery opened this issue Dec 21, 2022 · 2 comments · Fixed by #791
Milestone

Comments

@benjeffery
Copy link
Member

Often when running tsinfer, a subset of the variants are wanted for inference. Making a subset sgkit dataset isn't ideal, as it means making a copy of the dataset and rechunking (see https://github.com/pystatgen/sgkit/issues/981).

One option is to have tsinfer support a mask as part of the input dataset. I think I'm right in saying that for tsinfer variant_mask would make the most sense - but in sgkit call_genotype_mask is standard. We could use call_genotype_mask and error if a variant has a mixture of masked and unmasked calls?

The main changes to enable this are in the chunk iterator, which would need to skip the masked rows.

@benjeffery benjeffery added this to the Release 0.4.0 milestone Jan 11, 2023
@jeromekelleher
Copy link
Member

I think we need to support masks, because we want to support incremental exploration of the data with extra levels of filtering, without needing to store copies.

@benjeffery
Copy link
Member Author

After some discussion it seems best to add a variant_mask rather than use call_genotype mask. Also to use that array if present, rather than have the user specify on each call to tsinfer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants