Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query][rfc-0008] permit more INFO and FORMAT array-fields with missing elements #13465

Closed
wants to merge 1 commit into from

Conversation

danking
Copy link
Contributor

@danking danking commented Aug 21, 2023

CHANGELOG: Fixed #13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new hl.import_vcf parameter to resolve the ambiguity. See the hl.import_vcf docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the array_elements_required=True fast path by evaluating the following code against this PR's tip commit cd06c248e4 and 0.2.120 (f00f916faf). I ran it three times per commit and report each individual time as well as the average.

In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
commit run 1 (s) run 2 (s) run 3 (s) average (s) warm average (s)
cd06c248e4 (this PR) 116s 80s 77s 91+-18 78.5 +- 1.5
f00f916faf (0.2.120) 112s 80s 79s 90+-15 79.5 +- 0.5

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.


Note that I had to specifically override the Number setting for certain FORMAT and INFO fields because they were set to . in the benchmarked VCF. If this error appears in a 1KG VCF, it must be fairly common.

CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
@danking danking changed the title [query][rfc-0008] permit more array INFO fields with missing elements [query][rfc-0008] permit more INFO and FORMAT array-fields with missing elements Aug 21, 2023
@danking
Copy link
Contributor Author

danking commented Aug 21, 2023

Need more thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] bad error message when user needs to use array_elements_required=False
2 participants