Improvements to Ballista extensibility #8

thinkharderdev · 2022-01-25T20:09:19Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Currently, we are working with DataFusion/Ballista as a query execution engine. One of the primary selling points for DataFusion is extensibility but it is not currently possible to use the many extension points in DataFusion with Ballista.

This is primarily due to the constraints of serializing all logical and physical plans as Protobuf messages.

Ideally we would like to use Ballista to execute:

Scans using custom object stores
User Defined logical plan extensions
User defined physical plan extensions
User defined scalar and aggregation functions

Describe the solution you'd like
A clear and concise description of what you want to happen.

There are two things ideally:

We would like to decouple the core Ballista functionality from the serializable representations of plans so that the serde layer can become pluggable/extensible.
Serde should be aware of a user-defined ExecutionContext so we can leverage optimizers, extension planners, and udf/udaf

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

There currently is no workaround for this but we have been prototyping possible solutions which we'd be interested in upstreaming.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

alamb · 2022-01-25T20:48:28Z

cc @realno @gaojun2048 @yahoNanJing

andygrove · 2022-01-25T21:55:39Z

It may be useful to see how substrait is handling extensions as well - https://substrait.io/extensions/

realno · 2022-01-25T23:04:35Z

This would be a great improvement 👍 I will follow the design and PRs

yahoNanJing · 2022-01-26T02:20:20Z

Thanks @thinkharderdev for proposing these potentials.

Scans using custom object stores

For this, actually our team has implemented for the HDFS. To avoid new object store registration, our workaround is to make the path self description with its scheme, like hdfs:://localhost:15050/..../file.parquet. Then with the scheme, we will know which kind of remote object store we needs.

User Defined logical plan extensions, physical plan extensions, scalar and aggregation functions

Maybe it's better to introduce the substrait integration into the roadmap.

realno · 2022-01-26T03:32:11Z

Maybe it's better to introduce the substrait integration into the roadmap.

+1

houqp · 2022-01-26T05:11:52Z

FWIW, Andy wrote a substrait rust implementation: https://github.com/andygrove/substrait-rs.

thinkharderdev · 2022-01-26T12:56:31Z

Agree on the substrait integration. It would definitely be nice to have a universal serializable representation and a way to configure extensions delcaritively.

I posted a draft PR apache/datafusion#1677 which I think can solve the immediate term issues with extensibility and also will be useful in migrating to a substrait-based implementation. By decoupling the representation from the core execution engine we can avoid a "Big Bang" migration (not to mention and endless parade of painful rebases while in development :))

alamb · 2022-01-26T18:38:53Z

I had a comment on apache/datafusion#1677 (review) which I think is worth considering

thinkharderdev · 2022-01-26T21:11:20Z

Related questions after tinkering a bit more today:

Should SchemaProvider methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of LogicalPlan::TableScans by just passing a TableReference and resolving the TableProvider using the context.

alamb · 2022-01-27T21:31:43Z

Should SchemaProvider methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of LogicalPlan::TableScans by just passing a TableReference and resolving the TableProvider using the context.

I can see a rationale for making schema provider methods async if the usecase is a one time query across a pile of parquet files.

However, in general if one has to do network IO to figure out what tables exist or their schemas, it may be hard to get adequate performance

EricJoy2048 · 2022-05-20T14:58:23Z

cc @realno @gaojun2048 @yahoNanJing

Thank's for your CC.
I learned substrait documentation, I think it's useful to make plan serializable pluggable/extensible.

thinkharderdev · 2022-05-20T15:18:06Z

I think all the points in this have been addressed so we can close this issue.

thinkharderdev added the enhancement New feature or request label Jan 25, 2022

thinkharderdev mentioned this issue Jan 25, 2022

Abstract over logical and physical plan representations in Ballista apache/datafusion#1677

Merged

andygrove added the ballista label Feb 5, 2022

andygrove transferred this issue from apache/datafusion May 19, 2022

andygrove removed the ballista label May 20, 2022

thinkharderdev closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Ballista extensibility #8

Improvements to Ballista extensibility #8

thinkharderdev commented Jan 25, 2022

alamb commented Jan 25, 2022

andygrove commented Jan 25, 2022

realno commented Jan 25, 2022

yahoNanJing commented Jan 26, 2022 •

edited

Loading

realno commented Jan 26, 2022 •

edited

Loading

houqp commented Jan 26, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 26, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 27, 2022

EricJoy2048 commented May 20, 2022

thinkharderdev commented May 20, 2022

Improvements to Ballista extensibility #8

Improvements to Ballista extensibility #8

Comments

thinkharderdev commented Jan 25, 2022

alamb commented Jan 25, 2022

andygrove commented Jan 25, 2022

realno commented Jan 25, 2022

yahoNanJing commented Jan 26, 2022 • edited Loading

realno commented Jan 26, 2022 • edited Loading

houqp commented Jan 26, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 26, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 27, 2022

EricJoy2048 commented May 20, 2022

thinkharderdev commented May 20, 2022

yahoNanJing commented Jan 26, 2022 •

edited

Loading

realno commented Jan 26, 2022 •

edited

Loading