[SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan #29643

cloud-fan · 2020-09-03T18:03:42Z

What changes were proposed in this pull request?

This is a followup of #29485

It moves the plan rewriting methods from Analyzer to QueryPlan, so that it can work with SparkPlan as well. This PR also does an improvement to support a corner case (The attribute to be replace stays together with an unresolved attribute), and make it more general, so that WidenSetOperationTypes can rewrite the plan in one shot like before.

Why are the changes needed?

Code cleanup and generalize.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing test

cloud-fan · 2020-09-03T18:05:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      p -> attrMapping
-    } else {
-      // Just passes through unresolved nodes
-      plan.mapChildren {


This means that we won't replace attributes in an unresolved plan, which is not sufficient. See the updated test: https://github.com/apache/spark/pull/29643/files#diff-01ecdd038c5c2f53f38118912210fef8R1425

Good catch! In this unresolved plan, there might be other resolved and replaced attributes.

Ah, I see. Nice catch.

cloud-fan · 2020-09-03T18:07:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-   * The outer plan may have old references and the function below updates the
-   * outer references to refer to the new attributes.
-   *
-   * For example (SQL):


The example here is not useful at all. The first sentence already explains the reason very well, while the query plan example is hard to read.

cloud-fan · 2020-09-03T18:07:38Z

cc @maropu

viirya · 2020-09-03T19:02:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+   * This method also updates all the related references in this plan tree accordingly, in case
+   * the replaced node has different output expr ID than the old node.
+   */
+  def rewriteWithPlanMapping(


Actually I think we may not have chance to do such complicated replacement in physical plan level, but it is no harm to move this here.

SparkQA · 2020-09-03T19:36:11Z

Test build #128267 has finished for PR 29643 at commit 76cf567.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-03T19:39:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+            // the `oldAttr` must be part of either `plan.references` (so that it can be used to
+            // replace attributes of the current `plan`) or `plan.outputSet` (so that it can be
+            // used by those parent plans).
+            (plan.outputSet ++ plan.references).contains(oldAttr)


oh, we don't check if plan is resolved here, and plan.outputSet can cause error.

maropu · 2020-09-04T05:01:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+      planMapping: Map[PlanType, PlanType],
+      canGetOutput: PlanType => Boolean = _ => true): PlanType = {
+    def internalRewrite(plan: PlanType): (PlanType, Seq[(Attribute, Attribute)]) = {
+      if (planMapping.contains(plan)) {


IIUC this check cannot correctly handle nested cases in planMapping; for example,

Project +- Union :+- (1) Project : +- Union : : :+- (2) Project : : : +- Project : +- Project +- ...

If the two nested Projects above, (1) and (2), are stored in planMapping, I think only the case (1) is matched in this condition then the case (2) is just ignored. So, I rewrote the logic a bit so that plans are replaced in a bottom-up way in the previous PR:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 140 to 144 in 1de272f

val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]()

val newChildren = plan.children.map { child =>

// If not, we'd rewrite child plan recursively until we find the

// conflict node or reach the leaf node.

val (newChild, childAttrMapping) = rewritePlan(child, rewritePlanMap)

hmm, is this a real-world case? I think this is too complicated if we need to replace nodes in the values of planMapping.

hm, yea, this is complicated though, I remember the existing tests fail because of this reason. That might be the case below;

SQLQueryTestSuite.sql org.scalatest.exceptions.TestFailedException: union.sql Expected "struct<[c1:decimal(11,1),c2:string]>", but got "struct<[]>" Schema did not match for query #3 SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2 UNION ALL SELECT * FROM t2): -- !query SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2 UNION ALL SELECT * FROM t2) -- !query schema struct<> -- !query output org.apache.spark.sql.catalyst.errors.package$TreeNodeException After applying rule org.apache.spark.sql.catalyst.optimizer.RemoveNoopOperators in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken., tree: 'Union false, false

you are right, this is a valid use case.

SparkQA · 2020-09-04T07:05:02Z

Test build #128282 has finished for PR 29643 at commit 30e6c4a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-04T07:05:02Z

Test build #128281 has finished for PR 29643 at commit e796ea7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-04T07:09:45Z

retest this please

SparkQA · 2020-09-04T09:49:28Z

Test build #128284 has finished for PR 29643 at commit 30e6c4a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-04T14:58:15Z

@maropu I think it's too tricky if rewriting plan with planMapping is recursive. The reason is that, WidenSetOperationTypes does the work by traversing the plan tree twice. I made the plan rewriting method more general, so that WidenSetOperationTypes only need to traverse the plan tree once, and now the logic is simpler. Please take a look, thanks!

SparkQA · 2020-09-04T16:26:44Z

Test build #128306 has finished for PR 29643 at commit 5cce482.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-04T16:38:24Z

Test build #128307 has finished for PR 29643 at commit be7e864.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-09-07T06:17:27Z

The reason is that, WidenSetOperationTypes does the work by traversing the plan tree twice.

Ah, good point. Yea, the current approach in this PR looks okay to me if all the existing tests pass. Btw, could we backport the previous PR and this PR into branch-3.0? The branch also has the issue described in SPARK-32638.

cloud-fan · 2020-09-07T06:45:46Z

yea we can backport

SparkQA · 2020-09-07T07:05:02Z

Test build #128343 has finished for PR 29643 at commit 0857791.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-07T12:33:26Z

github action has passed.

maropu · 2020-09-07T14:12:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

-          e -> e
-      }.unzip
-      Project(casted._1, plan) -> Project(casted._2, plan)
+          Alias(Cast(e, dt, Some(SQLConf.get.sessionLocalTimeZone)), e.name)()


Out of curiosity; why we need to set timezone here?

otherwise WidenSetOperationTypes will return invalid attribute mapping (unresolved Alias with unresolved cast) when calling resolveOperatorsUpWithNewOutput

maropu

LGTM cc: @viirya @Ngone51

viirya · 2020-09-07T17:06:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+   * with a new one that has different output expr IDs, by updating the attribute references in
+   * the parent nodes accordingly.
+   *
+   * @param rule the function to transform plan nodes, and return new nodes with attributes mapping


Just a question. Why we need to return the attribute mapping from old to new? Can we just detect if the output of new plan is different to old plan, then create the mapping inside transformUpWithNewOutput?

Because that's too hard. For example, WidenSetOperationTypes returns attribute mapping according to the replaced children, not itself, because itself may not be resolved yet. While for self-join dedup, we return attribute mapping according to the current node.

viirya

LGTM, just one question.

maropu · 2020-09-08T00:54:43Z

Thanks! Merged to master.

…Plan ### What changes were proposed in this pull request? This is a followup of apache#29485 It moves the plan rewriting methods from `Analyzer` to `QueryPlan`, so that it can work with `SparkPlan` as well. This PR also does an improvement to support a corner case (The attribute to be replace stays together with an unresolved attribute), and make it more general, so that `WidenSetOperationTypes` can rewrite the plan in one shot like before. ### Why are the changes needed? Code cleanup and generalize. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes apache#29643 from cloud-fan/cleanup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…denSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…denSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from apache#29485 and apache#29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes apache#29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…kingTransformsInAnalyzer ### What changes were proposed in this pull request? In #29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query *** FAILED *** (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…kingTransformsInAnalyzer ### What changes were proposed in this pull request? In #29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query *** FAILED *** (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f0ffe0c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ulysses-you · 2023-02-23T11:30:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+        val attrMappingForCurrentPlan = attrMapping.filter {
+          // The `attrMappingForCurrentPlan` is used to replace the attributes of the
+          // current `plan`, so the `oldAttr` must be part of `plan.references`.
+          case (oldAttr, _) => plan.references.contains(oldAttr)


Shall we skip if child is not resolved ? Although, it would break the one shot rewrite idea. The reason is, call .references on an unresovled plan is dangerous that plan may use child.outpuSet as its references.

Can we add a base trait for plans that override references with child.outputSet? Then we can match this trait here and skip calling reference.

Ideally a plan should determine its reference by its expressions, but not by its child output attributes.

add a new base trait sounds good

I send a pr #40154 for it

probot-autolabeler bot added the SQL label Sep 3, 2020

cloud-fan commented Sep 3, 2020

View reviewed changes

viirya approved these changes Sep 3, 2020

View reviewed changes

viirya reviewed Sep 3, 2020

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-32638][SQL][FOLLOWUP] move the plan rewriting methods to QueryPlan~~ [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan Sep 4, 2020

maropu reviewed Sep 4, 2020

View reviewed changes

cloud-fan force-pushed the cleanup branch 2 times, most recently from e796ea7 to 30e6c4a Compare September 4, 2020 06:48

cloud-fan force-pushed the cleanup branch from 30e6c4a to 5cce482 Compare September 4, 2020 14:58

move the plan rewrite methods to QueryPlan

be7e864

cloud-fan force-pushed the cleanup branch from 5cce482 to be7e864 Compare September 4, 2020 15:04

maropu mentioned this pull request Sep 7, 2020

[SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans #29585

Closed

fix test

0857791

maropu reviewed Sep 7, 2020

View reviewed changes

maropu approved these changes Sep 7, 2020

View reviewed changes

viirya reviewed Sep 7, 2020

View reviewed changes

viirya approved these changes Sep 7, 2020

View reviewed changes

maropu closed this in 117a6f1 Sep 8, 2020

maropu mentioned this pull request Sep 8, 2020

[SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes #29680

Closed

yaooqinn mentioned this pull request Jan 4, 2021

[SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer #31013

Closed

ulysses-you reviewed Feb 23, 2023

View reviewed changes

	val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]()
	val newChildren = plan.children.map { child =>
	// If not, we'd rewrite child plan recursively until we find the
	// conflict node or reach the leaf node.
	val (newChild, childAttrMapping) = rewritePlan(child, rewritePlanMap)

[SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan #29643

[SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan #29643

Conversation

cloud-fan commented Sep 3, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 3, 2020

Choose a reason for hiding this comment

SparkQA commented Sep 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 4, 2020

SparkQA commented Sep 4, 2020

cloud-fan commented Sep 4, 2020

SparkQA commented Sep 4, 2020

cloud-fan commented Sep 4, 2020

SparkQA commented Sep 4, 2020

SparkQA commented Sep 4, 2020

maropu commented Sep 7, 2020

cloud-fan commented Sep 7, 2020

SparkQA commented Sep 7, 2020

cloud-fan commented Sep 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

maropu commented Sep 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 3, 2020 •

edited

Loading

maropu Sep 4, 2020 •

edited

Loading