[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

rednaxelafx · 2023-03-18T01:16:16Z

What changes were proposed in this pull request?

In EquivalentExpressions.addExpr(), add a guard supportedExpression() to make it consistent with addExprTree() and getExprState().

Why are the changes needed?

This fixes a regression caused by #39010 which added the supportedExpression() to addExprTree() and getExprState() but not addExpr().

One example of a use case affected by the inconsistency is the PhysicalAggregation pattern in physical planning. There, it calls addExpr() to deduplicate the aggregate expressions, and then calls getExprState() to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error).

Does this PR introduce any user-facing change?

This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions.

Example running the SQL command:

select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2)

example error message before the fix:

java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3]

after the fix this error is gone.

How was this patch tested?

Added new test cases to SubexpressionEliminationSuite for the immediate issue, and to DataFrameAggregateSuite for an example of user-visible symptom.

…ession()

rednaxelafx · 2023-03-18T01:25:34Z

cc @peter-toth @cloud-fan
Also cc @xinrong-meng for this being a potential Spark 3.4.0 release blocker.

peter-toth · 2023-03-18T08:38:47Z

Thanks @rednaxelafx for the fix and pinging me.
I think you are right that EquivalentExpressions.addExpr() should be guarded by supportedExpression() if we guard getExprState(). But, I'm not sure it is right that we don't deduplicate the max(transform(array(id), x -> x)) in your example query.
Probably the real issue here is that in PhysicalAggregation the class EquivalentExpressions is used for simply deduplicating whole expressions while on executors we use it for common subexpression elimination. In the former case we don't need the LambdaVariable guard but in the latter one we need it. So maybe we should add a argument to EquivalentExpressions to enable/disable the guards and in PhysicalAggregation we should disable it?

Kimahriman · 2023-03-18T13:11:32Z

Just seeing this group of PRs (most notably #39046), was there a real reason NamedLambdaVariable was added into all this mix? If I understand right, it effectively eliminates all subexpression elimination involving any expressions containing higher-order functions at any nested level, even though it's perfectly valid to pull out a complete high-order function, you just can't pull out the LambdaFunction by itself.

Currently the check for CodegenFallback is what prevents the LambdaFunction's from being considered for subexpression elimination. Quick plug for my 1.5 year old PR for adding codegen to HOFs #34558 simply adds HigherOrderFunction as a special case to only consider the arguments and not the functions themselves for subexpression elimination

peter-toth · 2023-03-18T15:36:06Z

Hm, I think you are right @Kimahriman, LambdaVariable and NamedLambdaVariable are very different and NamedLambdaVariable seem to be used only in LambdaFunctions, so #39046 doesn't make sense and actually it can prevent pulling out higher order functions and so cause performance regression... I think that PR should be reverted.
Update: I've filed a revert PR here: #40475

But I feel that is orthogonal to the issue that we use EquivalentExpressions for different purposes in PhysicalAggregation (the only place where we use .addExpr()) and in executors (.addExprTree() for subexpression elimination).

…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: #39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: #39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce3b03d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

peter-toth · 2023-03-20T12:52:53Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

+      ArrayTransform(arr, lambda)
+    }
+    val equivalence = new EquivalentExpressions
+    val isNewExpr = equivalence.addExpr(tx)


.addExpr()'s boolean result is counter-intuitive to other collection's .add() methods so IMO isNewExpr should be !equivalence.addExpr(tx) here. This is the reasons why I think getting rid of .addExpr() is probably the most straightforward fix here: #40488

rednaxelafx · 2023-03-21T00:39:37Z

@peter-toth could you please clarify why supportedExpression() was needed in getExprState() in the first place? i.e. why isn't it sufficient to add it to addExprTree()?

rednaxelafx · 2023-03-21T00:47:38Z

@Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again.

Kimahriman · 2023-03-21T00:55:19Z

@Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again.

Yeah I was just commenting on the related PR that broke CSE for anything using a HOF. I had plans for trying to do CSE inside a HOF but that stalled when I didn't get any traction on the initial adding codegen support

cloud-fan · 2023-03-21T01:35:42Z

The check was added to getExprState in #39010, which is to avoid canonicalizing a subquery expression and leading to NPE.

I agree that we should be consistent and this PR LGTM. Can we update the test case to use LambdaVariable as NamedLambdaVariable has been removed in #40475 ?

…amedLambdaVariable

cloud-fan · 2023-03-21T13:27:47Z

thanks, merging to master/3.4!

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by #39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes #40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: apache#39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce3b03d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))apache#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))apache#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…uivalentExpressions This PR reverts the follow-up PR of SPARK-41468: apache#39046 These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment) No. Existing UTs. Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Change-Id: Ia5ce83848956254664d9c51a2f0079bb968f5433 Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

SPARK-42851: guard EquivalentExpressions.addExpr() with supportedExpr…

08309e7

…ession()

github-actions bot added the SQL label Mar 18, 2023

peter-toth mentioned this pull request Mar 18, 2023

[SPARK-42852][SQL] Revert NamedLambdaVariable related changes from EquivalentExpressions #40475

Closed

peter-toth mentioned this pull request Mar 20, 2023

[SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation #40488

Closed

peter-toth reviewed Mar 20, 2023

View reviewed changes

adjust the test case

28d101e

address comments, adjust test case to use LambdaVariable instead of N…

d9f8bbd

…amedLambdaVariable

cloud-fan approved these changes Mar 21, 2023

View reviewed changes

cloud-fan closed this in ef0a76e Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

rednaxelafx commented Mar 18, 2023

rednaxelafx commented Mar 18, 2023

peter-toth commented Mar 18, 2023 •

edited

Loading

Kimahriman commented Mar 18, 2023 •

edited

Loading

peter-toth commented Mar 18, 2023 •

edited

Loading

peter-toth Mar 20, 2023

rednaxelafx commented Mar 21, 2023

rednaxelafx commented Mar 21, 2023

Kimahriman commented Mar 21, 2023

cloud-fan commented Mar 21, 2023

cloud-fan commented Mar 21, 2023

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

Conversation

rednaxelafx commented Mar 18, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rednaxelafx commented Mar 18, 2023

peter-toth commented Mar 18, 2023 • edited Loading

Kimahriman commented Mar 18, 2023 • edited Loading

peter-toth commented Mar 18, 2023 • edited Loading

peter-toth Mar 20, 2023

Choose a reason for hiding this comment

rednaxelafx commented Mar 21, 2023

rednaxelafx commented Mar 21, 2023

Kimahriman commented Mar 21, 2023

cloud-fan commented Mar 21, 2023

cloud-fan commented Mar 21, 2023

peter-toth commented Mar 18, 2023 •

edited

Loading

Kimahriman commented Mar 18, 2023 •

edited

Loading

peter-toth commented Mar 18, 2023 •

edited

Loading