Arrow enabled implementation and comparison with pySpark #45

voltcode · 2019-04-24T18:56:43Z

It seems that using Arrow enabled Pandas could greatly improve pySpark efficiency. Did you compare against Arrow enabled Pandars or the version without? Is this detail something you could highlight in the comparison? As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it? There's a bit of a discussion in dotnet/machinelearning#69 if you need more context.

rapoth · 2019-04-24T19:31:26Z

@voltcode: Thank you for your comment! We totally agree and we are working on more benchmarks as I type this :) For a comparison against Arrow enabled Pandas in Python, having efficient .NET support for Arrow is a must.

We are already working very closely with the Arrow community towards getting the initial version done over the next few weeks. @eerhardt from our team has been contributing several PRs lately to Apache Arrow:

Our plan is that once we have a reasonable Arrow .NET integration, we will run more benchmarks. This is a great point that you made. We are very open to hearing your take on any configurations we should pay special attention to. Since benchmarking is a sensitive topic, we want to be totally transparent about what we are running and how we are running - you can find all the benchmarking code we are using here. Our goal is to have community members guide us towards getting sensible benchmarks.

eerhardt · 2019-04-24T19:40:41Z

@voltcode - Arrow enabled Pandas UDFs are something we are actively working on. The preliminary numbers are promising when there are lots of records being serialized between the JVM and .NET, and we are working on making them better.

As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it?

A few months ago @chutchinson contributed the initial implementation of the .NET Apache Arrow library and we have been contributing to it as well. The .NET Arrow library shipped on nuget.org (https://www.nuget.org/packages/Apache.Arrow/) as part of the 0.13 release of Apache Arrow earlier this month, and more improvements are on the way. If anyone is interested in helping make the .NET Apache Arrow library better, check out the open .NET Arrow issues, or try out the Arrow NuGet package and log any suggestions/bugs/feedback. Contributions are always welcome.

voltcode · 2019-04-25T06:27:43Z

@rapoth thank you for the clarification. The benchmark link doesn't work - can you provide a new one?

imback82 · 2019-04-25T06:32:22Z

@voltcode Here is the benchmark link: https://github.com/dotnet/spark/tree/master/benchmark

rapoth · 2019-04-25T15:45:34Z

@voltcode: Made a linking mistake 🤦 I fixed my original post. Thanks for catching!

imback82 · 2019-08-15T23:50:45Z

cc: @Niharikadutta

eerhardt mentioned this issue May 31, 2019

Implement VectorUdf and use it in Queries 1 and 8 of TPCH benchmarks. #127

Merged

eerhardt mentioned this issue Jun 14, 2019

Implement Grouped Map UDFs #143

Merged

imback82 assigned imback82 and unassigned imback82 Aug 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow enabled implementation and comparison with pySpark #45

Arrow enabled implementation and comparison with pySpark #45

voltcode commented Apr 24, 2019

rapoth commented Apr 24, 2019 •

edited

Loading

eerhardt commented Apr 24, 2019

voltcode commented Apr 25, 2019

imback82 commented Apr 25, 2019

rapoth commented Apr 25, 2019

imback82 commented Aug 15, 2019

Arrow enabled implementation and comparison with pySpark #45

Arrow enabled implementation and comparison with pySpark #45

Comments

voltcode commented Apr 24, 2019

rapoth commented Apr 24, 2019 • edited Loading

eerhardt commented Apr 24, 2019

voltcode commented Apr 25, 2019

imback82 commented Apr 25, 2019

rapoth commented Apr 25, 2019

imback82 commented Aug 15, 2019

rapoth commented Apr 24, 2019 •

edited

Loading