Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

Closed
wants to merge 3 commits into from

Conversation

rednaxelafx
Copy link
Contributor

What changes were proposed in this pull request?

In EquivalentExpressions.addExpr(), add a guard supportedExpression() to make it consistent with addExprTree() and getExprState().

Why are the changes needed?

This fixes a regression caused by #39010 which added the supportedExpression() to addExprTree() and getExprState() but not addExpr().

One example of a use case affected by the inconsistency is the PhysicalAggregation pattern in physical planning. There, it calls addExpr() to deduplicate the aggregate expressions, and then calls getExprState() to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error).

Does this PR introduce any user-facing change?

This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions.

Example running the SQL command:

select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2)

example error message before the fix:

java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3]

after the fix this error is gone.

How was this patch tested?

Added new test cases to SubexpressionEliminationSuite for the immediate issue, and to DataFrameAggregateSuite for an example of user-visible symptom.

@github-actions github-actions bot added the SQL label Mar 18, 2023
@rednaxelafx
Copy link
Contributor Author

cc @peter-toth @cloud-fan
Also cc @xinrong-meng for this being a potential Spark 3.4.0 release blocker.

@peter-toth
Copy link
Contributor

peter-toth commented Mar 18, 2023

Thanks @rednaxelafx for the fix and pinging me.
I think you are right that EquivalentExpressions.addExpr() should be guarded by supportedExpression() if we guard getExprState(). But, I'm not sure it is right that we don't deduplicate the max(transform(array(id), x -> x)) in your example query.
Probably the real issue here is that in PhysicalAggregation the class EquivalentExpressions is used for simply deduplicating whole expressions while on executors we use it for common subexpression elimination. In the former case we don't need the LambdaVariable guard but in the latter one we need it. So maybe we should add a argument to EquivalentExpressions to enable/disable the guards and in PhysicalAggregation we should disable it?

@Kimahriman
Copy link
Contributor

Kimahriman commented Mar 18, 2023

Just seeing this group of PRs (most notably #39046), was there a real reason NamedLambdaVariable was added into all this mix? If I understand right, it effectively eliminates all subexpression elimination involving any expressions containing higher-order functions at any nested level, even though it's perfectly valid to pull out a complete high-order function, you just can't pull out the LambdaFunction by itself.

Currently the check for CodegenFallback is what prevents the LambdaFunction's from being considered for subexpression elimination. Quick plug for my 1.5 year old PR for adding codegen to HOFs #34558 simply adds HigherOrderFunction as a special case to only consider the arguments and not the functions themselves for subexpression elimination

@peter-toth
Copy link
Contributor

peter-toth commented Mar 18, 2023

Hm, I think you are right @Kimahriman, LambdaVariable and NamedLambdaVariable are very different and NamedLambdaVariable seem to be used only in LambdaFunctions, so #39046 doesn't make sense and actually it can prevent pulling out higher order functions and so cause performance regression... I think that PR should be reverted.
Update: I've filed a revert PR here: #40475

But I feel that is orthogonal to the issue that we use EquivalentExpressions for different purposes in PhysicalAggregation (the only place where we use .addExpr()) and in executors (.addExprTree() for subexpression elimination).

HyukjinKwon pushed a commit that referenced this pull request Mar 20, 2023
…uivalentExpressions

### What changes were proposed in this pull request?
This PR reverts the follow-up PR of SPARK-41468: #39046

### Why are the changes needed?
These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Mar 20, 2023
…uivalentExpressions

### What changes were proposed in this pull request?
This PR reverts the follow-up PR of SPARK-41468: #39046

### Why are the changes needed?
These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ce3b03d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ArrayTransform(arr, lambda)
}
val equivalence = new EquivalentExpressions
val isNewExpr = equivalence.addExpr(tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.addExpr()'s boolean result is counter-intuitive to other collection's .add() methods so IMO isNewExpr should be !equivalence.addExpr(tx) here. This is the reasons why I think getting rid of .addExpr() is probably the most straightforward fix here: #40488

@rednaxelafx
Copy link
Contributor Author

@peter-toth could you please clarify why supportedExpression() was needed in getExprState() in the first place? i.e. why isn't it sufficient to add it to addExprTree()?

@rednaxelafx
Copy link
Contributor Author

@Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again.

@Kimahriman
Copy link
Contributor

@Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again.

Yeah I was just commenting on the related PR that broke CSE for anything using a HOF. I had plans for trying to do CSE inside a HOF but that stalled when I didn't get any traction on the initial adding codegen support

@cloud-fan
Copy link
Contributor

The check was added to getExprState in #39010, which is to avoid canonicalizing a subquery expression and leading to NPE.

I agree that we should be consistent and this PR LGTM. Can we update the test case to use LambdaVariable as NamedLambdaVariable has been removed in #40475 ?

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in ef0a76e Mar 21, 2023
cloud-fan pushed a commit that referenced this pull request Mar 21, 2023
…edExpression()

### What changes were proposed in this pull request?

In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`.

### Why are the changes needed?

This fixes a regression caused by #39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`.

One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error).

### Does this PR introduce _any_ user-facing change?

This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions.

Example running the SQL command:
```sql
select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2)
```
example error message before the fix:
```
java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3]
```
after the fix this error is gone.

### How was this patch tested?

Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom.

Closes #40473 from rednaxelafx/spark-42851.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ef0a76e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…uivalentExpressions

### What changes were proposed in this pull request?
This PR reverts the follow-up PR of SPARK-41468: apache#39046

### Why are the changes needed?
These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ce3b03d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…edExpression()

### What changes were proposed in this pull request?

In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`.

### Why are the changes needed?

This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`.

One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error).

### Does this PR introduce _any_ user-facing change?

This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions.

Example running the SQL command:
```sql
select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2)
```
example error message before the fix:
```
java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))apache#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))apache#3]
```
after the fix this error is gone.

### How was this patch tested?

Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom.

Closes apache#40473 from rednaxelafx/spark-42851.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ef0a76e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
peter-toth added a commit to peter-toth/spark that referenced this pull request Nov 26, 2024
…uivalentExpressions

This PR reverts the follow-up PR of SPARK-41468: apache#39046

These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment)

No.

Existing UTs.

Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes.

Change-Id: Ia5ce83848956254664d9c51a2f0079bb968f5433
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants