Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41035][SQL] Don't patch foldable children of aggregate functions in RewriteDistinctAggregates #38565

Closed

Conversation

bersprockets
Copy link
Contributor

@bersprockets bersprockets commented Nov 8, 2022

What changes were proposed in this pull request?

RewriteDistinctAggregates doesn't typically patch foldable children of the distinct aggregation functions except in one odd case (and seemingly by accident). This PR extends the policy of not patching foldables to that odd case.

Why are the changes needed?

This query produces incorrect results:

select a, count(distinct 100) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |0   |
|4  |1   |0   |
+---+----+----+

The values for cnt2 should be 1 and 1 (not 0 and 0).

If you change the literal used in the first aggregate function, the second aggregate function now works correctly:

select a, count(distinct 101) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |1   |
|4  |1   |1   |
+---+----+----+

The bug is in the rule RewriteDistinctAggregates. When a distinct aggregation has only foldable children, RewriteDistinctAggregates uses the first child as the grouping key (grouping key in this context means the function children of distinct aggregate functions: RewriteDistinctAggregates groups distinct aggregations by function children to determine the Expand projections it needs to create). Therefore, the first foldable child gets included in the Expand projection associated with the aggregation, with a corresponding output attribute that is also included in the map for patching aggregate functions in the final aggregation.

The Expand projections for all other distinct aggregate groups will have null in the slot associated with that output attribute. If the same foldable expression is used in a distinct aggregation associated with a different group, RewriteDistinctAggregates will improperly patch the associated aggregate function to use the previous aggregation's output attribute. Since the output attribute is associated with a different group, the value of that slot in the Expand projection will always be null.

In the example above, count(distinct 100) as cnt1 is the aggregation with only foldable children, and count(distinct b, 100) as cnt2 is the aggregation that gets inappropriately patched with the wrong group's output attribute. As a result count(distinct b, 100) as cnt2 (from the first example above) essentially becomes count(distinct b, null) as cnt2, which is always zero.

RewriteDistinctAggregates doesn't typically patch foldable children of the distinct aggregation functions in the final aggregation. It potentially patches foldable expressions only when there is a distinct aggregation with only foldable children, and even then it doesn't patch the aggregation that has only foldable children, but instead some other unlucky aggregate function that happened to use the same foldable expression.

This PR skips patching any foldable expressions in the aggregate functions to avoid patching an aggregation with a different group's output attribute.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

@github-actions github-actions bot added the SQL label Nov 8, 2022
@HyukjinKwon
Copy link
Member

Merged to master, branch-3.3, and branch-3.2.

@bersprockets
Copy link
Contributor Author

@HyukjinKwon Thanks!

@bersprockets bersprockets deleted the distinct_literal_issue branch November 11, 2022 02:40
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…ns in `RewriteDistinctAggregates`

### What changes were proposed in this pull request?

`RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions except in one odd case (and seemingly by accident). This PR extends the policy of not patching foldables to that odd case.

### Why are the changes needed?

This query produces incorrect results:
```
select a, count(distinct 100) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |0   |
|4  |1   |0   |
+---+----+----+
```
The values for `cnt2` should be 1 and 1 (not 0 and 0).

If you change the literal used in the first aggregate function, the second aggregate function now works correctly:
```
select a, count(distinct 101) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |1   |
|4  |1   |1   |
+---+----+----+
```
The bug is in the rule `RewriteDistinctAggregates`. When a distinct aggregation has only foldable children, `RewriteDistinctAggregates` uses the first child as the grouping key (_grouping key_ in this context means the function children of distinct aggregate functions: `RewriteDistinctAggregates` groups distinct aggregations by function children to determine the `Expand` projections  it needs to create). Therefore, the first foldable child gets included in the `Expand` projection associated with the aggregation, with a corresponding output attribute that is also included in the map for patching aggregate functions in the final aggregation.

The `Expand` projections for all other distinct aggregate groups will have `null` in the slot associated with that output attribute. If the same foldable expression is used in a distinct aggregation associated with a different group, `RewriteDistinctAggregates` will improperly patch the associated aggregate function to use the previous aggregation's output attribute. Since the output attribute is associated with a different group, the value of that slot in the `Expand` projection will always be `null`.

In the example above, `count(distinct 100) as cnt1` is the aggregation with only foldable children, and `count(distinct b, 100) as cnt2` is the aggregation that gets inappropriately patched with the wrong group's output attribute. As a result `count(distinct b, 100) as cnt2` (from the first example above) essentially becomes `count(distinct b, null) as cnt2`, which is always zero.

`RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions in the final aggregation. It potentially patches foldable expressions only when there is a distinct aggregation with only foldable children, and even then it doesn't patch the aggregation that has only foldable children, but instead some other unlucky aggregate function that happened to use the same foldable expression.

This PR skips patching any foldable expressions in the aggregate functions to avoid patching an aggregation with a different group's output attribute.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes apache#38565 from bersprockets/distinct_literal_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…ns in `RewriteDistinctAggregates`

`RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions except in one odd case (and seemingly by accident). This PR extends the policy of not patching foldables to that odd case.

This query produces incorrect results:
```
select a, count(distinct 100) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |0   |
|4  |1   |0   |
+---+----+----+
```
The values for `cnt2` should be 1 and 1 (not 0 and 0).

If you change the literal used in the first aggregate function, the second aggregate function now works correctly:
```
select a, count(distinct 101) as cnt1, count(distinct b, 100) as cnt2
from values (1, 2), (4, 5) as data(a, b)
group by a;

+---+----+----+
|a  |cnt1|cnt2|
+---+----+----+
|1  |1   |1   |
|4  |1   |1   |
+---+----+----+
```
The bug is in the rule `RewriteDistinctAggregates`. When a distinct aggregation has only foldable children, `RewriteDistinctAggregates` uses the first child as the grouping key (_grouping key_ in this context means the function children of distinct aggregate functions: `RewriteDistinctAggregates` groups distinct aggregations by function children to determine the `Expand` projections  it needs to create). Therefore, the first foldable child gets included in the `Expand` projection associated with the aggregation, with a corresponding output attribute that is also included in the map for patching aggregate functions in the final aggregation.

The `Expand` projections for all other distinct aggregate groups will have `null` in the slot associated with that output attribute. If the same foldable expression is used in a distinct aggregation associated with a different group, `RewriteDistinctAggregates` will improperly patch the associated aggregate function to use the previous aggregation's output attribute. Since the output attribute is associated with a different group, the value of that slot in the `Expand` projection will always be `null`.

In the example above, `count(distinct 100) as cnt1` is the aggregation with only foldable children, and `count(distinct b, 100) as cnt2` is the aggregation that gets inappropriately patched with the wrong group's output attribute. As a result `count(distinct b, 100) as cnt2` (from the first example above) essentially becomes `count(distinct b, null) as cnt2`, which is always zero.

`RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions in the final aggregation. It potentially patches foldable expressions only when there is a distinct aggregation with only foldable children, and even then it doesn't patch the aggregation that has only foldable children, but instead some other unlucky aggregate function that happened to use the same foldable expression.

This PR skips patching any foldable expressions in the aggregate functions to avoid patching an aggregation with a different group's output attribute.

No.

New unit test.

Closes apache#38565 from bersprockets/distinct_literal_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 0add57a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants