-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] resolve VCF array field ambiguity #8
Conversation
45094ff
to
1f3e5a7
Compare
1f3e5a7
to
75305bf
Compare
rfc/0008-template.rst
Outdated
2. Add ``disambiguate_single_dot: Dict[str, Callable[[Expression], Expression]]`` which, for each | ||
INFO field f, accepts a row at which field f was "." and returns a value for field f. The value | ||
of other fields with a "." in this row is unspecified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you flesh this out a bit? I didn't understand on my first reading. Offline, we also discussed specifying the argument to the Callable to be a struct containing the entire current row, but with all fields in disambiguate_single_dot
missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrick-schultz addressed
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
No description provided.