You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Often when running tsinfer, a subset of the variants are wanted for inference. Making a subset sgkit dataset isn't ideal, as it means making a copy of the dataset and rechunking (see https://github.com/pystatgen/sgkit/issues/981).
One option is to have tsinfer support a mask as part of the input dataset. I think I'm right in saying that for tsinfer variant_mask would make the most sense - but in sgkit call_genotype_mask is standard. We could use call_genotype_mask and error if a variant has a mixture of masked and unmasked calls?
The main changes to enable this are in the chunk iterator, which would need to skip the masked rows.
The text was updated successfully, but these errors were encountered:
I think we need to support masks, because we want to support incremental exploration of the data with extra levels of filtering, without needing to store copies.
After some discussion it seems best to add a variant_mask rather than use call_genotype mask. Also to use that array if present, rather than have the user specify on each call to tsinfer.
Often when running tsinfer, a subset of the variants are wanted for inference. Making a subset sgkit dataset isn't ideal, as it means making a copy of the dataset and rechunking (see https://github.com/pystatgen/sgkit/issues/981).
One option is to have tsinfer support a mask as part of the input dataset. I think I'm right in saying that for tsinfer
variant_mask
would make the most sense - but in sgkitcall_genotype_mask
is standard. We could usecall_genotype_mask
and error if a variant has a mixture of masked and unmasked calls?The main changes to enable this are in the chunk iterator, which would need to skip the masked rows.
The text was updated successfully, but these errors were encountered: