[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

HyukjinKwon · 2019-02-12T04:07:05Z

What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true

collect(createDataFrame(mtcars))

Requirements

R 3.5.x

Arrow package 0.12+

Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'

Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

Benchmarks

Shall

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g

R code

df <- cache(createDataFrame(read.csv("500000.csv")))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  collect(df)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()

Data (350 MB):

object.size(read.csv("500000.csv"))
350379504 bytes

"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

Results

Time difference of 221.32014 secs

Time difference of 15.51145 secs

The performance improvement was around 1426%.

Limitations:

For now, Arrow optimization with R does not support when the data is raw, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.
Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

How was this patch tested?

Existing tests related with Arrow optimization cover this change. Also, manually tested.

HyukjinKwon · 2019-02-12T04:08:20Z

cc @BryanCutler, @viirya, @felixcheung, @icexelloss, @rxin, @gatorsmile, @shivaram, @falaki, @yanboliang

Looks previous collect code wasn't performant enough. This optimization applies to head and take as well.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

core/src/main/scala/org/apache/spark/api/r/RRDD.scala

R/pkg/R/DataFrame.R

HyukjinKwon · 2019-02-12T04:30:04Z

I am going to update SQLConf, documentation later when this job is finished. Also, I need to deduplicate some logics across R with Arrow when the job is done later.

SparkQA · 2019-02-12T14:03:14Z

Test build #102242 has finished for PR 23760 at commit 10c3f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-12T17:01:26Z

Test build #102249 has finished for PR 23760 at commit 8c84556.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM, I just skimmed the R code but the rest seems reasonable and the speedup looks awesome!

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

BryanCutler · 2019-02-13T01:10:52Z

It might be a good idea to add a test that forces a delay on a partition execution so that you can verify that the R side receives them in the correct order. This was discussed here and added in bf2feec

R/pkg/R/DataFrame.R

HyukjinKwon · 2019-02-13T03:23:06Z

BTW, I am speeding up and planing to make a blog post at Apache Arrow like https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/ (thanks for letting me know @felixcheung).

Looks sparklyr added Arrow optimization already.

HyukjinKwon · 2019-02-13T05:13:31Z

re: #23760 (comment)

Thing is, nowadays SparkR doesn't have RDD APIs and in the transition to be removed out completely. Maybe I can try to test with dapply but I think it's difficult to get the partition index. Should be okay since the codes were restored from the previous codes almost as are.

falaki

Thanks for doing this. I did a first pass.

R/pkg/R/DataFrame.R

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

HyukjinKwon · 2019-02-15T03:25:14Z

I hope we can go ahead as is if there are not notable comments to avoid conflict hell ..

Currently I intentionally didn't start to work some items at #23787 (comment) to avoid conflicts but I am trying to complete all as soon as possible.

FWIW, sparklyr already added Arrow optimization, sparklyr#1611 and https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/

HyukjinKwon · 2019-02-18T03:43:44Z

gentle ping. Would you guys mind if I go ahead?

felixcheung · 2019-02-19T01:50:07Z

looks like it's help to break off the remaining tasks in JIRA? #23787 (comment)

felixcheung · 2019-02-19T01:50:27Z

sorry, clicked wrong

HyukjinKwon · 2019-02-19T01:52:57Z

Yup, Will add obvious ones first.

felixcheung · 2019-02-19T01:54:29Z

pending follow ups, ok to me to merge this first. it's getting long to track what should be done and what's changed

HyukjinKwon · 2019-02-19T04:10:34Z

I have added the test, manually ran the tests, created JIRAs under https://issues.apache.org/jira/browse/SPARK-26759 for follow ups. Will get this in soon if there's no more particular comments.

R/pkg/tests/fulltests/test_sparkSQL.R

SparkQA · 2019-02-19T21:40:26Z

Test build #102502 has finished for PR 23760 at commit cfe947c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T03:34:38Z

Merged to master.

Thank you all, @BryanCutler, @vanzin, @felixcheung, @viirya, @falaki

…taFrame to R DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around **1426%**. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested. Closes apache#23760 from HyukjinKwon/SPARK-26762. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon commented Feb 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

HyukjinKwon commented Feb 12, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/api/r/RRDD.scala Show resolved Hide resolved

HyukjinKwon commented Feb 12, 2019

View reviewed changes

R/pkg/R/DataFrame.R Show resolved Hide resolved

HyukjinKwon commented Feb 12, 2019

View reviewed changes

R/pkg/R/DataFrame.R Show resolved Hide resolved

HyukjinKwon closed this Feb 12, 2019

HyukjinKwon reopened this Feb 12, 2019

This comment has been minimized.

Sign in to view

BryanCutler reviewed Feb 13, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

viirya reviewed Feb 13, 2019

View reviewed changes

R/pkg/R/DataFrame.R Show resolved Hide resolved

felixcheung reviewed Feb 13, 2019

View reviewed changes

R/pkg/R/DataFrame.R Show resolved Hide resolved

R/pkg/R/DataFrame.R Show resolved Hide resolved

R/pkg/R/DataFrame.R Outdated Show resolved Hide resolved

R/pkg/R/DataFrame.R Show resolved Hide resolved

This comment has been minimized.

Sign in to view

falaki suggested changes Feb 14, 2019

View reviewed changes

R/pkg/R/DataFrame.R Show resolved Hide resolved

R/pkg/R/DataFrame.R Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

HyukjinKwon mentioned this pull request Feb 15, 2019

[SPARK-26830][SQL][R] Vectorized R dapply() implementation #23787

Closed

felixcheung closed this Feb 19, 2019

felixcheung reopened this Feb 19, 2019

HyukjinKwon added 5 commits February 19, 2019 10:11

Arrow optimization for conversion from Spark DataFrame to R DataFrame

a86c4f0

Fix style

7f327f0

Show proper messages if Arrow is not installed

e75ee3f

Address Felix's comments

bba8d94

Add tests for many partitions

cfe947c

HyukjinKwon force-pushed the SPARK-26762 branch from 79f204e to cfe947c Compare February 19, 2019 02:12

BryanCutler reviewed Feb 19, 2019

View reviewed changes

R/pkg/tests/fulltests/test_sparkSQL.R Show resolved Hide resolved

HyukjinKwon closed this in 3c15d8b Feb 20, 2019

HyukjinKwon mentioned this pull request Nov 3, 2019

[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh #26375

Closed

HyukjinKwon deleted the SPARK-26762 branch March 3, 2020 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

HyukjinKwon commented Feb 12, 2019 •

edited

Loading

HyukjinKwon commented Feb 12, 2019

HyukjinKwon commented Feb 12, 2019

This comment has been minimized.

This comment has been minimized.

SparkQA commented Feb 12, 2019

SparkQA commented Feb 12, 2019

BryanCutler left a comment •

edited

Loading

BryanCutler commented Feb 13, 2019

HyukjinKwon commented Feb 13, 2019

HyukjinKwon commented Feb 13, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

falaki left a comment

HyukjinKwon commented Feb 15, 2019 •

edited

Loading

HyukjinKwon commented Feb 18, 2019

felixcheung commented Feb 19, 2019

felixcheung commented Feb 19, 2019

HyukjinKwon commented Feb 19, 2019

felixcheung commented Feb 19, 2019

HyukjinKwon commented Feb 19, 2019

SparkQA commented Feb 19, 2019

HyukjinKwon commented Feb 20, 2019

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

Conversation

HyukjinKwon commented Feb 12, 2019 • edited Loading

What changes were proposed in this pull request?

Requirements

Benchmarks

Limitations:

How was this patch tested?

HyukjinKwon commented Feb 12, 2019

HyukjinKwon commented Feb 12, 2019

This comment has been minimized.

This comment has been minimized.

SparkQA commented Feb 12, 2019

SparkQA commented Feb 12, 2019

BryanCutler left a comment • edited Loading

Choose a reason for hiding this comment

BryanCutler commented Feb 13, 2019

HyukjinKwon commented Feb 13, 2019

HyukjinKwon commented Feb 13, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

falaki left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 15, 2019 • edited Loading

HyukjinKwon commented Feb 18, 2019

felixcheung commented Feb 19, 2019

felixcheung commented Feb 19, 2019

HyukjinKwon commented Feb 19, 2019

felixcheung commented Feb 19, 2019

HyukjinKwon commented Feb 19, 2019

SparkQA commented Feb 19, 2019

HyukjinKwon commented Feb 20, 2019

HyukjinKwon commented Feb 12, 2019 •

edited

Loading

BryanCutler left a comment •

edited

Loading

HyukjinKwon commented Feb 15, 2019 •

edited

Loading