Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Ballista extensibility #8

Closed
thinkharderdev opened this issue Jan 25, 2022 · 12 comments
Closed

Improvements to Ballista extensibility #8

thinkharderdev opened this issue Jan 25, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@thinkharderdev
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Currently, we are working with DataFusion/Ballista as a query execution engine. One of the primary selling points for DataFusion is extensibility but it is not currently possible to use the many extension points in DataFusion with Ballista.

This is primarily due to the constraints of serializing all logical and physical plans as Protobuf messages.

Ideally we would like to use Ballista to execute:

  • Scans using custom object stores
  • User Defined logical plan extensions
  • User defined physical plan extensions
  • User defined scalar and aggregation functions

Describe the solution you'd like
A clear and concise description of what you want to happen.

There are two things ideally:

  1. We would like to decouple the core Ballista functionality from the serializable representations of plans so that the serde layer can become pluggable/extensible.
  2. Serde should be aware of a user-defined ExecutionContext so we can leverage optimizers, extension planners, and udf/udaf

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

There currently is no workaround for this but we have been prototyping possible solutions which we'd be interested in upstreaming.

Additional context
Add any other context or screenshots about the feature request here.

@alamb
Copy link
Contributor

alamb commented Jan 25, 2022

cc @realno @gaojun2048 @yahoNanJing

@andygrove
Copy link
Member

It may be useful to see how substrait is handling extensions as well - https://substrait.io/extensions/

@realno
Copy link
Contributor

realno commented Jan 25, 2022

This would be a great improvement 👍 I will follow the design and PRs

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Jan 26, 2022

Thanks @thinkharderdev for proposing these potentials.

Scans using custom object stores

For this, actually our team has implemented for the HDFS. To avoid new object store registration, our workaround is to make the path self description with its scheme, like hdfs:://localhost:15050/..../file.parquet. Then with the scheme, we will know which kind of remote object store we needs.

User Defined logical plan extensions, physical plan extensions, scalar and aggregation functions

Maybe it's better to introduce the substrait integration into the roadmap.

@realno
Copy link
Contributor

realno commented Jan 26, 2022

Maybe it's better to introduce the substrait integration into the roadmap.

+1

@houqp
Copy link
Member

houqp commented Jan 26, 2022

FWIW, Andy wrote a substrait rust implementation: https://github.com/andygrove/substrait-rs.

@thinkharderdev
Copy link
Contributor Author

Agree on the substrait integration. It would definitely be nice to have a universal serializable representation and a way to configure extensions delcaritively.

I posted a draft PR apache/datafusion#1677 which I think can solve the immediate term issues with extensibility and also will be useful in migrating to a substrait-based implementation. By decoupling the representation from the core execution engine we can avoid a "Big Bang" migration (not to mention and endless parade of painful rebases while in development :))

@alamb
Copy link
Contributor

alamb commented Jan 26, 2022

I had a comment on apache/datafusion#1677 (review) which I think is worth considering

@thinkharderdev
Copy link
Contributor Author

Related questions after tinkering a bit more today:

Should SchemaProvider methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of LogicalPlan::TableScans by just passing a TableReference and resolving the TableProvider using the context.

@alamb
Copy link
Contributor

alamb commented Jan 27, 2022

Should SchemaProvider methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of LogicalPlan::TableScans by just passing a TableReference and resolving the TableProvider using the context.

I can see a rationale for making schema provider methods async if the usecase is a one time query across a pile of parquet files.

However, in general if one has to do network IO to figure out what tables exist or their schemas, it may be hard to get adequate performance

@andygrove andygrove transferred this issue from apache/datafusion May 19, 2022
@EricJoy2048
Copy link
Member

cc @realno @gaojun2048 @yahoNanJing

Thank's for your CC.
I learned substrait documentation, I think it's useful to make plan serializable pluggable/extensible.

@thinkharderdev
Copy link
Contributor Author

I think all the points in this have been addressed so we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants