-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473
Conversation
cc @peter-toth @cloud-fan |
Thanks @rednaxelafx for the fix and pinging me. |
Just seeing this group of PRs (most notably #39046), was there a real reason Currently the check for |
Hm, I think you are right @Kimahriman, But I feel that is orthogonal to the issue that we use |
…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: #39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: #39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: #40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce3b03d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ArrayTransform(arr, lambda) | ||
} | ||
val equivalence = new EquivalentExpressions | ||
val isNewExpr = equivalence.addExpr(tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.addExpr()
's boolean result is counter-intuitive to other collection's .add()
methods so IMO isNewExpr
should be !equivalence.addExpr(tx)
here. This is the reasons why I think getting rid of .addExpr()
is probably the most straightforward fix here: #40488
@peter-toth could you please clarify why |
@Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again. |
Yeah I was just commenting on the related PR that broke CSE for anything using a HOF. I had plans for trying to do CSE inside a HOF but that stalled when I didn't get any traction on the initial adding codegen support |
…amedLambdaVariable
thanks, merging to master/3.4! |
…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by #39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes #40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…uivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: apache#39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce3b03d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))apache#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))apache#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…uivalentExpressions This PR reverts the follow-up PR of SPARK-41468: apache#39046 These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: apache#40473 (comment) No. Existing UTs. Closes apache#40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Change-Id: Ia5ce83848956254664d9c51a2f0079bb968f5433 Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
In
EquivalentExpressions.addExpr()
, add a guardsupportedExpression()
to make it consistent withaddExprTree()
andgetExprState()
.Why are the changes needed?
This fixes a regression caused by #39010 which added the
supportedExpression()
toaddExprTree()
andgetExprState()
but notaddExpr()
.One example of a use case affected by the inconsistency is the
PhysicalAggregation
pattern in physical planning. There, it callsaddExpr()
to deduplicate the aggregate expressions, and then callsgetExprState()
to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error).Does this PR introduce any user-facing change?
This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions.
Example running the SQL command:
example error message before the fix:
after the fix this error is gone.
How was this patch tested?
Added new test cases to
SubexpressionEliminationSuite
for the immediate issue, and toDataFrameAggregateSuite
for an example of user-visible symptom.