[query/vds] Add LEN field to VDS #14675

chrisvittal · 2024-09-09T16:21:25Z

This is the beginning of a series of changes to support export of VDS to VCF 4.5, the version of VCF that contains the standardized form of our work that culminated in SVCR/VDS.

Reference blocks were standardized with a LEN rather than an END. So, now, by default, add LEN to all VDS reads and drop END in favor of LEN on all VDS writes. Our optimizer will be able to take care of pruning away the dead field in pipelines that don't use it.

We make sure that all VDS creation (other than the combiner), such as read_vds and from_merged_representation, contains both LEN and END preserving user code that depends on the presence of the END field.

Furthermore, this change contains necessary combiner updates to prefer LEN over END, and to use LEN in the combiner itself.

patrick-schultz

Looking good, just a few small things

hail/python/hail/vds/combiner/variant_dataset_combiner.py

hail/python/hail/vds/variant_dataset.py

patrick-schultz

Thanks Chris!

cjllanwarne

Looks good. I've left a few minor comments but happy to switch over to Approve if you prefer to leave things as they are.

hail/python/hail/vds/variant_dataset.py

cjllanwarne · 2024-09-17T16:40:38Z

hail/python/hail/vds/variant_dataset.py

            .or_error(
                hl.str(
-                    'cannot create VDS from merged representation -' ' found END field with non-reference genotype at '


TIL python lets you concatenate strings by just putting them next to each other like this 🤯

After split_multi, LGT is dropped from the variant data of a VDS. After PR hail-is#14560, LGT is added to datasets after creation via the combiner. After hail-is#14675 the same is true for `from_merged_representation`. We should keep the GT/LGT field consistent across ref and var data. This change does so for split_multi. Resolves hail-is#14694

After split_multi, LGT is dropped from the variant data of a VDS. After PR #14560, LGT is added to datasets after creation via the combiner. After #14675 the same is true for `from_merged_representation`. We should keep the GT/LGT field consistent across ref and var data. This change does so for split_multi. Resolves #14694

We can't do this anymore since genotype may be something other than diploid. Missed this in the original VDS ploidy changes

This is the beginning of a series of changes to support export of VDS to VCF 4.5, the version of VCF that contains the standardized form of our work that culminated in SVCR/VDS. Reference blocks were standardized with a LEN rather than an END. So, now, by default, add LEN to all VDS reads and drop END in favor of LEN on all VDS writes. Our optimizer will be able to take care of pruning away the dead field. We make sure that all VDS creation (other than the combiner), such as read_vds and from_merged_representation, contains both LEN and END preserving user code that depends on the presence of the END field.

Co-authored-by: Patrick Schultz <pschultz@broadinstitute.org>

dismissing this since I'd like another look after working around one of the core issues discoverd here

patrick-schultz

Looks great, thanks Chris!

chrisvittal mentioned this pull request Sep 9, 2024

[query] Support VCF v4.5 #14655

Open

7 tasks

chrisvittal marked this pull request as ready for review September 9, 2024 20:44

chrisvittal requested a review from patrick-schultz September 9, 2024 21:52

patrick-schultz requested changes Sep 11, 2024

View reviewed changes

hail/python/hail/vds/combiner/variant_dataset_combiner.py Outdated Show resolved Hide resolved

hail/python/hail/vds/variant_dataset.py Outdated Show resolved Hide resolved

hail/python/hail/vds/variant_dataset.py Outdated Show resolved Hide resolved

chrisvittal requested a review from patrick-schultz September 11, 2024 17:23

patrick-schultz previously approved these changes Sep 11, 2024

View reviewed changes

chrisvittal requested a review from cjllanwarne September 12, 2024 13:40

cjllanwarne requested changes Sep 17, 2024

View reviewed changes

chrisvittal requested a review from cjllanwarne September 17, 2024 16:59

cjllanwarne approved these changes Sep 17, 2024

View reviewed changes

chrisvittal mentioned this pull request Sep 18, 2024

[vds] Unify GT/LGT after split_multi for reference data #14695

Merged

chrisvittal force-pushed the vds/len-end-translation branch from 9bce5c9 to d7d31cb Compare September 20, 2024 17:14

[query/vds] Fix removal of ref GT in from_merged_representation

57406e6

We can't do this anymore since genotype may be something other than diploid. Missed this in the original VDS ploidy changes

chrisvittal mentioned this pull request Sep 23, 2024

[query/vds] Fix removal of ref GT in from_merged_representation #14699

Closed

chrisvittal and others added 9 commits September 23, 2024 14:20

Update combiner

76f981f

More fixes in the combiner

338ebf4

use LEN for store_ref_block_max_length

8f35dba

Apply suggestion

6af5cce

Co-authored-by: Patrick Schultz <pschultz@broadinstitute.org>

rename ref_block_field to ref_block_indicator_field

3a94f56

No need to check for END if it is dropped

3719d9a

Updates after review

2e3bb73

missed some stuff, very minor

f39364c

chrisvittal force-pushed the vds/len-end-translation branch from 473e90f to f39364c Compare September 23, 2024 18:21

chrisvittal requested a review from patrick-schultz September 30, 2024 16:26

chrisvittal mentioned this pull request Sep 30, 2024

It's #14675 but different #14703

Closed

Don't add, then immidiately drop LEN

56b6c23

chrisvittal force-pushed the vds/len-end-translation branch from a8961c7 to 56b6c23 Compare September 30, 2024 16:40

chrisvittal mentioned this pull request Sep 30, 2024

[query] Memory error in VDS combiner after adding, then immidiately dropping a field. #14705

Open

patrick-schultz approved these changes Oct 1, 2024

View reviewed changes

hail-ci-robot merged commit 97e7833 into hail-is:main Oct 5, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query/vds] Add LEN field to VDS #14675

[query/vds] Add LEN field to VDS #14675

chrisvittal commented Sep 9, 2024 •

edited

Loading

patrick-schultz left a comment

patrick-schultz left a comment

cjllanwarne left a comment

cjllanwarne Sep 17, 2024

patrick-schultz left a comment

[query/vds] Add LEN field to VDS #14675

[query/vds] Add LEN field to VDS #14675

Conversation

chrisvittal commented Sep 9, 2024 • edited Loading

patrick-schultz left a comment

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

cjllanwarne left a comment

Choose a reason for hiding this comment

cjllanwarne Sep 17, 2024

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

chrisvittal commented Sep 9, 2024 •

edited

Loading