-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow enabled implementation and comparison with pySpark #45
Comments
@voltcode: Thank you for your comment! We totally agree and we are working on more benchmarks as I type this :) For a comparison against Arrow enabled Pandas in Python, having efficient .NET support for Arrow is a must. We are already working very closely with the Arrow community towards getting the initial version done over the next few weeks. @eerhardt from our team has been contributing several PRs lately to Apache Arrow:
Our plan is that once we have a reasonable Arrow .NET integration, we will run more benchmarks. This is a great point that you made. We are very open to hearing your take on any configurations we should pay special attention to. Since benchmarking is a sensitive topic, we want to be totally transparent about what we are running and how we are running - you can find all the benchmarking code we are using here. Our goal is to have community members guide us towards getting sensible benchmarks. |
@voltcode - Arrow enabled Pandas UDFs are something we are actively working on. The preliminary numbers are promising when there are lots of records being serialized between the JVM and .NET, and we are working on making them better.
A few months ago @chutchinson contributed the initial implementation of the .NET Apache Arrow library and we have been contributing to it as well. The .NET Arrow library shipped on nuget.org (https://www.nuget.org/packages/Apache.Arrow/) as part of the |
@rapoth thank you for the clarification. The benchmark link doesn't work - can you provide a new one? |
@voltcode Here is the benchmark link: https://github.com/dotnet/spark/tree/master/benchmark |
@voltcode: Made a linking mistake 🤦 I fixed my original post. Thanks for catching! |
cc: @Niharikadutta |
It seems that using Arrow enabled Pandas could greatly improve pySpark efficiency. Did you compare against Arrow enabled Pandars or the version without? Is this detail something you could highlight in the comparison? As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it? There's a bit of a discussion in dotnet/machinelearning#69 if you need more context.
The text was updated successfully, but these errors were encountered: