Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow enabled implementation and comparison with pySpark #45

Open
voltcode opened this issue Apr 24, 2019 · 6 comments
Open

Arrow enabled implementation and comparison with pySpark #45

voltcode opened this issue Apr 24, 2019 · 6 comments

Comments

@voltcode
Copy link

It seems that using Arrow enabled Pandas could greatly improve pySpark efficiency. Did you compare against Arrow enabled Pandars or the version without? Is this detail something you could highlight in the comparison? As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it? There's a bit of a discussion in dotnet/machinelearning#69 if you need more context.

@rapoth
Copy link
Contributor

rapoth commented Apr 24, 2019

@voltcode: Thank you for your comment! We totally agree and we are working on more benchmarks as I type this :) For a comparison against Arrow enabled Pandas in Python, having efficient .NET support for Arrow is a must.

We are already working very closely with the Arrow community towards getting the initial version done over the next few weeks. @eerhardt from our team has been contributing several PRs lately to Apache Arrow:

Our plan is that once we have a reasonable Arrow .NET integration, we will run more benchmarks. This is a great point that you made. We are very open to hearing your take on any configurations we should pay special attention to. Since benchmarking is a sensitive topic, we want to be totally transparent about what we are running and how we are running - you can find all the benchmarking code we are using here. Our goal is to have community members guide us towards getting sensible benchmarks.

@eerhardt
Copy link
Member

@voltcode - Arrow enabled Pandas UDFs are something we are actively working on. The preliminary numbers are promising when there are lots of records being serialized between the JVM and .NET, and we are working on making them better.

As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it?

A few months ago @chutchinson contributed the initial implementation of the .NET Apache Arrow library and we have been contributing to it as well. The .NET Arrow library shipped on nuget.org (https://www.nuget.org/packages/Apache.Arrow/) as part of the 0.13 release of Apache Arrow earlier this month, and more improvements are on the way. If anyone is interested in helping make the .NET Apache Arrow library better, check out the open .NET Arrow issues, or try out the Arrow NuGet package and log any suggestions/bugs/feedback. Contributions are always welcome.

@voltcode
Copy link
Author

@rapoth thank you for the clarification. The benchmark link doesn't work - can you provide a new one?

@imback82
Copy link
Contributor

@voltcode Here is the benchmark link: https://github.com/dotnet/spark/tree/master/benchmark

@rapoth
Copy link
Contributor

rapoth commented Apr 25, 2019

@voltcode: Made a linking mistake 🤦 I fixed my original post. Thanks for catching!

@imback82
Copy link
Contributor

cc: @Niharikadutta

@imback82 imback82 assigned imback82 and unassigned imback82 Aug 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants