-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6550][SQL] Add PreAnalyzer to keep logical plan consistent across DataFrame #5203
Conversation
Test build #29219 has finished for PR 5203 at commit
|
The test failure is caused by another commit and just fixed in #5198. Please test it again. |
Test build #29236 has finished for PR 5203 at commit
|
Thanks for finding this issue! I'm hoping there is a simpler solution than adding a whole new analyzer phase though. In #5217 I propose we simply hang on to the analyze plan instead of the logical one. This doesn't add a new phase and reduces the overhead of analyzing a plan over and over again. What do you think? |
Looks good and it can solve this. Agree that is better to use simpler way for this bug. Thanks. |
I will close this once #5217 is merged. |
This is based on bug and test case proposed by viirya. See #5203 for a excellent description of the problem. TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`. However, this `AttributeReference` is based on an analyzed plan which is thrown away. At execution time, we once again analyze the plan. However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid. As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`. Author: Michael Armbrust <michael@databricks.com> Closes #5217 from marmbrus/preanalyzer and squashes the following commits: 1f98e2d [Michael Armbrust] revert change dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame 089c52e [Michael Armbrust] WIP (cherry picked from commit 5d9c37c) Signed-off-by: Michael Armbrust <michael@databricks.com>
This is based on bug and test case proposed by viirya. See #5203 for a excellent description of the problem. TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`. However, this `AttributeReference` is based on an analyzed plan which is thrown away. At execution time, we once again analyze the plan. However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid. As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`. Author: Michael Armbrust <michael@databricks.com> Closes #5217 from marmbrus/preanalyzer and squashes the following commits: 1f98e2d [Michael Armbrust] revert change dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame 089c52e [Michael Armbrust] WIP
Thanks, I've merged #5217. |
Problems
In some cases, the expressions in a logical plan will be modified to new ones during analysis, e.g. the handling for self-join cases. If some expressions are resolved based on the analyzed plan, they are referring to changed expression ids, not original ids.
But the transformation of DataFrame will use logical plan to construct new DataFrame, e.g.
groupBy
and aggregation. So in such cases, the expressions in these DataFrames will be inconsistent.The problems are specified as following:
When we try to run the following codes:
Because
groupBy
andmin
will perform resolving based on the analyzed logical plan, their expression ids refer to analyzed plan, instead of logical plan.So the logical plan of df2 looks like:
As you see, the expression ids in
Aggregate
are different to the expression ids inSubquery y
. This is the first problem.df2
can't be performedThe showing logical plan of
df2
can't be performed. Because the expression ids ofSubquery y
will be modified for self-join handling during analysis, the analyzed plan ofdf2
becomes:The expressions referred in
Aggregate
are not matching to these inSubquery y
. This is the second problem.Proposed solution
We try to add a
PreAnalyzer
. When a logical planrawPlan
is given to SQLContext, it uses PreAnalyzer to modify the logical plan before assigning toQueryExecution.logical
. Then later operations will based on the pre-analyzed logical plan, instead of the originalrawPlan
.