-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503
[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503
Conversation
Test build #91507 has finished for PR 21503 at commit
|
@cloud-fan, this is the PR for moving push-down to the physical plan conversion and reporting the stats correctly. Sorry for the confusion because I sent a link to just the second commit. |
(projectSet ++ filterSet).toSeq | ||
} | ||
|
||
val reader = relation.newReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm, do we have to do operator pushdown twice now? One in the plan visitor to calculate statistics, one here to build the physical plan, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will configure two readers. One for the pushdown when converting to a physical plan and one for stats. The stats one should be temporary, though, since we want to address the problem. Configuring two readers instead of one allows us to decouple the problems so we can move forward with pushdown that works like the other data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's nice to decouple the problem and do pushdown during planning, but I feel the cost is too high in this approach. For file-based data sources, we need to query hive metastore to apply partitioning pruning during filter pushdown, and this can be very expensive. Doing it twice looks scaring to me.
cc @gatorsmile @dongjoon-hyun @mallman , please correct me if I have a wrong understanding.
also cc @wzhfy do you have an estimation about how long it takes to move statistics to physical plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, there's nothing forcing other data sources to implement the new trait. Other sources can continue to report stats for the entire table and not account for filters (the code assumes that row counts don't change). This just opens the option of reporting stats that are more accurate using the filters and projection that will be pushed.
Ideally, I think that stats-based decisions would happen after pushdown so we get data that is as accurate as possible. But for now, this fixes the regression for v2 sources that happens because we move pushdown to a later step (conversion to physical plan like the other sources).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's nothing forcing other data sources to implement the new trait ...
hmmm, I'm a little confused here. All v2 data sources (will be DataSourceV2Relation
) would have to apply pushdown twice right? Or are you suggesting we should not migrate file-based data source to data source v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either option #1 or #2. #2 is basically what happens for non-v2 data sources right now. Plus, both should be temporary.
I think it is a bad idea to continue with hacky code that uses the reader in the logical plan. It is much cleaner otherwise and we've spent too much time making sure that everything still works. The main example that comes to mind is setting the requested projection and finding out what output is using pushdown. I think hacks are slowing progress on the v2 sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the second proposal is what happens for the v1 data sources. For file-based data source we kind of pick the third proposal and add an optimizer rule PruneFileSourcePartitions
to push down some of the filters to data source at the logical phase, to get precise stats.
I'd like to pick from the 2nd and 3rd proposals(the 3rd proposal is also temporary, before we move stats to physical plan). Applying pushdown twice is hard to workaround(need to cache), while we can keep the PruneFileSourcePartitions
rule to work around the issue in 2nd proposal for file-based data sources.
Let's also get more inputs from other people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strongly opposed to any of the options, but based on the description above 2 would be my choice if I had to pick one. A temporary state where functionality is missing is easier to reason about than temporary states where we deliberately impose a fuzzy lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue do you have time to prepare a PR for the 2rd proposal? I can do that too if you are busy with other stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no problem. I can just remove the stats commit from this one.
9d3a11e
to
c8517e1
Compare
} | ||
|
||
override def computeStats(): Statistics = reader match { | ||
override def computeStats(): Statistics = newReader match { | ||
case r: SupportsReportStatistics => | ||
Statistics(sizeInBytes = r.getStatistics.sizeInBytes().orElse(conf.defaultSizeInBytes)) | ||
case _ => | ||
Statistics(sizeInBytes = conf.defaultSizeInBytes) | ||
} | ||
|
||
override def newInstance(): DataSourceV2Relation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to override newInstance
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that initially, but the canonicalization test was failing without this.
Let's also update the classdoc of |
Updated the stats interface. |
Test build #91789 has finished for PR 21503 at commit
|
Test build #91782 has finished for PR 21503 at commit
|
@cloud-fan, tests are passing for c8517e1, which has all of the functional changes. The Jenkins job ran out of memory for the last commit, but the only change in it is in comments to add the note you requested. Should be good to go. |
cc @rxin if you are interested. |
thanks, merging to master! |
Thank you for reviewing this, @cloud-fan! |
…physical conversion ## What changes were proposed in this pull request? This is a followup of #21503, to completely move operator pushdown to the planner rule. The code are mostly from #21319 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21574 from cloud-fan/followup.
…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. (cherry picked from commit 22daeba) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. (cherry picked from commit 1737d45) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is 9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.
…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup.
…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. (cherry picked from commit 22daeba) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. (cherry picked from commit 1737d45) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. Ref: LIHADOOP-48531 RB=1850239 G=superfriends-reviewers R=zolin,yezhou,latang,fli,mshen A=
…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. Ref: LIHADOOP-48531 RB=1853689 G=superfriends-reviewers R=zolin,fli,yezhou,mshen,latang A=
What changes were proposed in this pull request?
This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by @marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.
A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.
The first commit was proposed in #21262. This PR replaces #21262.
How was this patch tested?
Existing tests.