-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query/vds] Add LEN field to VDS #14675
[query/vds] Add LEN field to VDS #14675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few small things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I've left a few minor comments but happy to switch over to Approve if you prefer to leave things as they are.
.or_error( | ||
hl.str( | ||
'cannot create VDS from merged representation -' ' found END field with non-reference genotype at ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL python lets you concatenate strings by just putting them next to each other like this 🤯
After split_multi, LGT is dropped from the variant data of a VDS. After PR hail-is#14560, LGT is added to datasets after creation via the combiner. After hail-is#14675 the same is true for `from_merged_representation`. We should keep the GT/LGT field consistent across ref and var data. This change does so for split_multi. Resolves hail-is#14694
After split_multi, LGT is dropped from the variant data of a VDS. After PR hail-is#14560, LGT is added to datasets after creation via the combiner. After hail-is#14675 the same is true for `from_merged_representation`. We should keep the GT/LGT field consistent across ref and var data. This change does so for split_multi. Resolves hail-is#14694
After split_multi, LGT is dropped from the variant data of a VDS. After PR #14560, LGT is added to datasets after creation via the combiner. After #14675 the same is true for `from_merged_representation`. We should keep the GT/LGT field consistent across ref and var data. This change does so for split_multi. Resolves #14694
9bce5c9
to
d7d31cb
Compare
We can't do this anymore since genotype may be something other than diploid. Missed this in the original VDS ploidy changes
This is the beginning of a series of changes to support export of VDS to VCF 4.5, the version of VCF that contains the standardized form of our work that culminated in SVCR/VDS. Reference blocks were standardized with a LEN rather than an END. So, now, by default, add LEN to all VDS reads and drop END in favor of LEN on all VDS writes. Our optimizer will be able to take care of pruning away the dead field. We make sure that all VDS creation (other than the combiner), such as read_vds and from_merged_representation, contains both LEN and END preserving user code that depends on the presence of the END field.
Co-authored-by: Patrick Schultz <pschultz@broadinstitute.org>
473e90f
to
f39364c
Compare
dismissing this since I'd like another look after working around one of the core issues discoverd here
a8961c7
to
56b6c23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks Chris!
This is the beginning of a series of changes to support export of VDS to VCF 4.5, the version of VCF that contains the standardized form of our work that culminated in SVCR/VDS.
Reference blocks were standardized with a LEN rather than an END. So, now, by default, add LEN to all VDS reads and drop END in favor of LEN on all VDS writes. Our optimizer will be able to take care of pruning away the dead field in pipelines that don't use it.
We make sure that all VDS creation (other than the combiner), such as read_vds and from_merged_representation, contains both LEN and END preserving user code that depends on the presence of the END field.
Furthermore, this change contains necessary combiner updates to prefer LEN over END, and to use LEN in the combiner itself.