-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760
Conversation
cc @BryanCutler, @viirya, @felixcheung, @icexelloss, @rxin, @gatorsmile, @shivaram, @falaki, @yanboliang Looks previous |
I am going to update |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test build #102242 has finished for PR 23760 at commit
|
Test build #102249 has finished for PR 23760 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just skimmed the R code but the rest seems reasonable and the speedup looks awesome!
BTW, I am speeding up and planing to make a blog post at Apache Arrow like https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/ (thanks for letting me know @felixcheung). Looks sparklyr added Arrow optimization already. |
re: #23760 (comment) Thing is, nowadays SparkR doesn't have RDD APIs and in the transition to be removed out completely. Maybe I can try to test with |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I did a first pass.
I hope we can go ahead as is if there are not notable comments to avoid conflict hell .. Currently I intentionally didn't start to work some items at #23787 (comment) to avoid conflicts but I am trying to complete all as soon as possible. FWIW, sparklyr already added Arrow optimization, sparklyr#1611 and https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/ |
gentle ping. Would you guys mind if I go ahead? |
looks like it's help to break off the remaining tasks in JIRA? #23787 (comment) |
sorry, clicked wrong |
Yup, Will add obvious ones first. |
pending follow ups, ok to me to merge this first. it's getting long to track what should be done and what's changed |
79f204e
to
cfe947c
Compare
I have added the test, manually ran the tests, created JIRAs under https://issues.apache.org/jira/browse/SPARK-26759 for follow ups. Will get this in soon if there's no more particular comments. |
Test build #102502 has finished for PR 23760 at commit
|
Merged to master. Thank you all, @BryanCutler, @vanzin, @felixcheung, @viirya, @falaki |
…taFrame to R DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around **1426%**. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested. Closes apache#23760 from HyukjinKwon/SPARK-26762. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…taFrame to R DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around **1426%**. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested. Closes apache#23760 from HyukjinKwon/SPARK-26762. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.
This can be tested as below:
collect(createDataFrame(mtcars))
Requirements
Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'
Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.
Benchmarks
Shall
sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
R code
Data (350 MB):
"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
Results
The performance improvement was around 1426%.
Limitations:
For now, Arrow optimization with R does not support when the data is
raw
, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.
How was this patch tested?
Existing tests related with Arrow optimization cover this change. Also, manually tested.