-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Substrait: Add producer and consumer for physical plans #5173
Comments
What is the expected behavior for converting " |
I renamed this ticket to be an epic and started collecting tasks needed for better support |
@andygrove @alamb Could you recommend the best path for implementing these tasks? Since we’re building a distributed query engine based on DataFusion, which requires splitting a physical plan into pipelines, we’re willing to contribute to enhancing the current Substrait functionality in DataFusion. |
Hi @niebayes -- I recommend coordinating with @vbarua and @Blizzara and @wackywendell , others who I think use substrait with physical plans I think we maybe already have physical consumer/producers, see: https://docs.rs/datafusion-substrait/45.0.0/datafusion_substrait/physical_plan/index.html The first task migh tbe to go through the existing tickets and see which ones are still relevant |
@alamb Thanks for your advice. I would first pick a few small tickets to be more familiar with the codebase. |
There indeed exists some kind of producer and consumer for physical plans, but quickly checking they seem very limited. My interest is only in logical plans currently, and I think the same applies for Victor at least from what I've seen (but I may be wrong there). I don't know much about physical plans overall so dunno if it could reuse parts of the logical plans work, but at least the logical plan consumer/producer can be used as inspiration :) |
@Blizzara Thanks for your reply. I initially choose the physical plan because there're more computation can be distributed to executors in a distributed query engine. Say a sql:
The corresponding logical plan might be:
And the physical plan might look like:
By learning from the datafusion-ballista project, I know we can split the execution plan at pipeline breakers (including RepartitionExec, SortPreservingExec, CoalescePartitionsExec, etc.). So the above execution plan would be split into two parts (aka. pipelines):
As you can see, the first stage of the parallel aggregation algorithm can be distributed to multiple executors which makes the resource utilization better. By the way, datafusion-ballista is good for OLAP workloads and it assumes executors are stateless. However, in my scenario, executors are stateful and each executor maintain an in-memory buffer containing the most recently written data (History data are stored in shared object storage). So, when the scheduler is about to construct a physical plan, it has to query each executor for the latest statistics which is required for query optimization. I wonder if it's the standard approach to achieve distributed query based on DataFusion, since the implementation seems complicated. I really hope the DataFusion community can provide some recommendations on building a distributed query engine based on DataFusion. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to use substrait with physical plans. I plan on having an initial PR up this weekend.
Describe the solution you'd like
Describe alternatives you've considered
Additional context
Substrait to DataFusion's logical plan is tracked at #8149
Tasks:
ExecutionPlan
to substrait #9299truncate
#9727ExecutionPlan
to substrait #9299ReadType::LocalFiles
#10864The text was updated successfully, but these errors were encountered: