sgkit: Support `call_genotype_mask`? #786

benjeffery · 2022-12-21T13:56:02Z

Often when running tsinfer, a subset of the variants are wanted for inference. Making a subset sgkit dataset isn't ideal, as it means making a copy of the dataset and rechunking (see https://github.com/pystatgen/sgkit/issues/981).

One option is to have tsinfer support a mask as part of the input dataset. I think I'm right in saying that for tsinfer variant_mask would make the most sense - but in sgkit call_genotype_mask is standard. We could use call_genotype_mask and error if a variant has a mixture of masked and unmasked calls?

The main changes to enable this are in the chunk iterator, which would need to skip the masked rows.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2023-01-11T13:26:02Z

I think we need to support masks, because we want to support incremental exploration of the data with extra levels of filtering, without needing to store copies.

benjeffery · 2023-01-11T13:31:42Z

After some discussion it seems best to add a variant_mask rather than use call_genotype mask. Also to use that array if present, rather than have the user specify on each call to tsinfer.

benjeffery added this to the Release 0.4.0 milestone Jan 11, 2023

benjeffery mentioned this issue Jan 17, 2023

Add mask to sgkit SampleData #791

Merged

mergify bot closed this as completed in #791 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgkit: Support `call_genotype_mask`? #786

sgkit: Support `call_genotype_mask`? #786

benjeffery commented Dec 21, 2022

jeromekelleher commented Jan 11, 2023

benjeffery commented Jan 11, 2023

sgkit: Support call_genotype_mask? #786

sgkit: Support call_genotype_mask? #786

Comments

benjeffery commented Dec 21, 2022

jeromekelleher commented Jan 11, 2023

benjeffery commented Jan 11, 2023

sgkit: Support `call_genotype_mask`? #786

sgkit: Support `call_genotype_mask`? #786