-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Is Ballista a standalone system or framework #1916
Comments
For what it is worth, my mental model is similar to @gaojun2048 -- that DataFusion is mostly designed to be a library to build other systems -- for example the Ballista I have always thought of as planning to be a standalone system (that people would deploy without having to build it from source). I am curious to hear what others think cc @edrevo @andygrove @thinkharderdev @matthewmturner @liukun4515 @Ted-Jiang @yjshen |
I agree that the goal should be for Ballista to be a standalone system (or at least something that CAN be deployed as a standalone system), but one of the unique and valuable aspects of DataFusion is the extensibility. It is a really important differentiator to other solutions in this space and I think we should ensure that the extensibility of DataFusion extends to Ballista. |
I also agree that Ballista should be a standalone system, with a client API that can be used as a library. |
Extensibility is a major selling point, the only thing I'd be worried about is having stable interfaces. |
I agree ballista should be a standalone system like Spark SQL or Presto.
So it's not a Spark Core on RDD but Spark Core on SQL physical plan, right? |
I agree with you. |
+1 Ballista should be a standalone system. I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? First thought came to mind is to have physical plan be the interface, different clients have the option to extend SQL parser, optimizer and planner as needed. Most can reuse the existing implementation, an example for Ballista is to extend the planner to support udf. |
i think i agree with @realno. for example the datafusion python bindings are effectively reusing the rust implementation which thus far for my (limited) use cases has worked well. the python bindings also have udfs but i havent had chance to look into how theyre implemented and how that may contrast to ballistas needs. |
Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @jimexist is leading the effort in #1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules. One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer. |
I think this is a good direction. |
good point. |
agree with that and same opinion in this #1881 (comment) |
standalone system +1 |
I am now back working on Ballista after a bit of a break and can now contribute a bit more to this discussion. Although I see Ballista as a standalone system that users will likely install as Docker images or from a tarball, I think it is still important to publish the crates to crates.io so that It is also important that we publish the I noticed that we cannot publish the Ballista crates from the recent 7.0.0 release tarball and have filed #1980 to fix that for the next release. |
I understand that Ballista is currently heading toward being standalone system, but I am wondering that is what the ecosystem needs. I feel that being a plugable library is a big part of Datafusion's success. But the projects that are embedding Datafusion today as a single node compute engine, are they not going to need to be distributed tomorrow? If Ballista is really designed as a standalone system, those growing projects might use it as an example on how to distribute the Datafusion query plan, but they might not be able to reuse much code. Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄 |
DataFusion is not JVM based, which could be an interesting differentiator. I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different Comparatively I think a singe node column oriented analytic query engine is a fairly well understood pattern (though I do think the DataFusion implementation is very good ) One thing I personally hope is that Ballista drives features into DataFusion so that making a new distributed engine using DataFusion becomes easier over time. Some examples of this technical flow I think are:
|
Agreed! I was more thinking about designing the engine in modules that can be re-used more easily by other distributed system. For instance using [https://github.com/substrait-io/substrait] to serialize the tasks for the workers instead of the current custom protos could help re-use ballista workers with minimal effort from other systems. |
I feel some opportunities/differentiators for Ballista are the following:
We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. I feel the pivot point for having a new reference distributed compute platform is getting closer. :) |
Datafusion has a huge advantage, but the user story needs to be improved.
Also, horizontal scalability and using it for ML in distributed is still an
issue because it needs to be able to map partitions and have multi-stage
jobs.
Le dim. 13 mars 2022 à 22:51, Lin Ma ***@***.***> a écrit :
… Also, as a standalone system, Ballista will compete with the heavy weights
in the category (Spark, Presto..). That is an interesting but very
ambitious goal 😄
I feel some opportunities/differentiators for Ballista are the following:
1. non-JVM - this brings a lot of benefit such as lower footprint,
memory efficiency, and no GC cost (Rust specific)
2. A chance for more modern design principles - for example Spark was
originally architected to best deployed to bare metal, it is hard to make
some changes to be more cloud friendly
3. Utilize modern resource management and orchestration technologies -
reusing mature tools like k8s will simplify Ballista's implementation and
integrate easily with modern systems
4. Using Arrow as the backbone opens doors for more advanced use case
such as ML - it may be efficiently integrated with Pandas or Tensorflow
through Arrow.
We heavily use systems like Spark for Analytics and ML, the above points
are pain points that worth consider switching.
—
Reply to this email directly, view it on GitHub
<#1916 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBWMEEONETGGK73ASOTU7ZWNXANCNFSM5P3MCIBQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
I think the discussion on this issue is closed (Ballista is a system seems to be the prevailing consensus). Closing ticket |
This ticket is to try and pull some of the great discussion from #1881 into a ticket to try and get consensus on a topic that is larger than user defined functions (the topic of #1881 )
The main question is about how Ballista is/should be used. Is it a framework like DataFusion or a standalone system (like Spark)? The answer to this question would help drive technical choices such as #1881
It would also be great to update the Ballista docs when we have consensus to this question
Originally posted by @gaojun2048 in #1881 (comment)
I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.
@realno says #1881 (comment)
The text was updated successfully, but these errors were encountered: