Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] resolve VCF array field ambiguity #8

Merged
merged 3 commits into from
Aug 21, 2023
Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Aug 2, 2023

No description provided.

@danking danking force-pushed the vcf-array-field-ambiguity branch 2 times, most recently from 45094ff to 1f3e5a7 Compare August 17, 2023 23:22
@danking danking force-pushed the vcf-array-field-ambiguity branch from 1f3e5a7 to 75305bf Compare August 17, 2023 23:26
Comment on lines 90 to 92
2. Add ``disambiguate_single_dot: Dict[str, Callable[[Expression], Expression]]`` which, for each
INFO field f, accepts a row at which field f was "." and returns a value for field f. The value
of other fields with a "." in this row is unspecified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you flesh this out a bit? I didn't understand on my first reading. Offline, we also discussed specifying the argument to the Callable to be a struct containing the entire current row, but with all fields in disambiguate_single_dot missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrick-schultz addressed

danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.
danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.
danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this pull request Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
@danking danking merged commit a23806f into main Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants