[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance #22275

BryanCutler · 2018-08-29T21:57:23Z

What changes were proposed in this pull request?

When executing toPandas with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in.

This change sends un-ordered partitions to Python as soon as they arrive in the JVM, followed by a list of partition indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased.

Followup to #21546

How was this patch tested?

Added new test with a large number of batches per partition, and test that forces a small delay in the first partition. These test that partitions are collected out-of-order and then are are put in the correct order in Python.

Performance Tests - toPandas

Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8
measured wall clock time to execute toPandas() and took the average best time of 5 runs/5 loops each.

Test code

df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand())
for i in range(5):
	start = time.time()
	_ = df.toPandas()
	elapsed = time.time() - start

Spark config

spark.driver.memory 5g
spark.executor.memory 5g
spark.driver.maxResultSize 2g
spark.sql.execution.arrow.enabled true

Current Master w/ Arrow stream	This PR
5.16207	4.342533
5.133671	4.399408
5.147513	4.468471
5.105243	4.36524
5.018685	4.373791

Avg Master	Avg This PR
5.1134364	4.3898886

Speedup of 1.164821449

viirya · 2018-08-29T22:59:32Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+          if (partitionCount == numPartitions) {
+            batchWriter.end()
+            out.writeInt(batchOrder.length)
+            // Batch order indices are from 0 to N-1 batches, sorted by order they arrived


nit: Batch order indices are from 0 to N-1 batches, sorted by order they arrived. Re-sort indices to the correct order to build a table.

How about a slight change? // Re-order batches according to these indices to build a table.

How about something like // Sort by the output global batch indexes partition index, partition batch index tuple?
When I was first read this code path I got confused my self so I think we should spend a bit of time on the comment here.

yeah, sounds good

viirya · 2018-08-29T23:09:38Z

python/pyspark/serializers.py

@@ -187,9 +187,15 @@ def loads(self, obj):

 class ArrowStreamSerializer(Serializer):
    """
-    Serializes Arrow record batches as a stream.
+    Serializes Arrow record batches as a stream. Optionally load the ordering of the batches as a


This is optional. Do we have other usage of this ArrowStreamSerializer without the ordering?

Yeah, it's also used in the createDataFrame path, but that does only use dump_stream. Still, it seemed best to make this an optional feature of the serializer.

SparkQA · 2018-08-30T01:56:27Z

Test build #95441 has finished for PR 22275 at commit d6fefee.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrowStreamSerializer(Serializer):

BryanCutler · 2018-08-30T18:17:46Z

Thanks @viirya ! What are your thoughts @HyukjinKwon ? I consolidated the batch order serializer from before into the ArrowStreamSerializer to simplify a little.

BryanCutler · 2018-09-19T18:13:03Z

@holdenk I was wondering if you had any thoughts on this? Thanks!

holdenk · 2018-09-19T18:32:47Z

Sure, I'll take a look on Friday if it's not urgent

felixcheung

generally, is this going to limit how much data to pass along because of the bit length of the index?

BryanCutler · 2018-09-21T00:12:36Z

generally, is this going to limit how much data to pass along because of the bit length of the index?

So the index passed to python is the RecordBatch index, not an element index, and it would limit the number of batches to Int.MAX. I wouldn't expect that would be likely and you can always set the number of batches to 1 per partition, so that would be the limiting factor then. WDYT @felixcheung ?

felixcheung · 2018-09-21T07:18:42Z

got it. so the size of the each batch could grow.

felixcheung · 2018-09-21T07:21:44Z

python/pyspark/sql/tests.py

+        df = self.spark.range(64, numPartitions=8).toDF("a")
+        with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 4}):
+            pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
+            self.assertPandasEqual(pdf, pdf_arrow)


hm, is this test case "enough" to trigger any possible problem just by random? would increasing the number of batch or num record per batch increase the chance of streaming order or concurrency issue perhaps?

This looks pretty similar to the kind of test case we could verify with something like hypothesis. Integrating hypothesis is probably too much work, but we could at least explore num partitions space in a loop quickly here. Would that help do you think @felixcheung ?

that sounds good

holdenk

Thanks for working on this, reducing memory overhead pressure in the driver is always welcome :) I got a bit confused reading the PR the first time through so I think it might make sense to look at how we can improve the readability a bit.

holdenk · 2018-09-21T16:31:14Z

python/pyspark/serializers.py

+                index = read_int(stream)
+                self.batch_order.append(index)
+
+    def get_batch_order_and_reset(self):


Looking at _load_from_socket I think I understand why this was done as a separate function here, but what about if the serializer its self returned either a tuple or re-ordered the batches its self?

I'm just trying to get a better understanding, not saying those are better designs.

holdenk · 2018-09-21T16:33:08Z

python/pyspark/serializers.py

@@ -208,8 +214,26 @@ def load_stream(self, stream):
        for batch in reader:
            yield batch

+        if self.load_batch_order:
+            num = read_int(stream)
+            self.batch_order = []


If we're going to have get_batch_order_and_reset as a separate function, could we verify batch_order is None before we reset and throw here if it's not? Just thinking of future folks who might have to debug something here.

holdenk · 2018-09-21T16:35:46Z

python/pyspark/sql/tests.py

+        df = self.spark.range(64, numPartitions=8).toDF("a")
+        with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 4}):
+            pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
+            self.assertPandasEqual(pdf, pdf_arrow)


This looks pretty similar to the kind of test case we could verify with something like hypothesis. Integrating hypothesis is probably too much work, but we could at least explore num partitions space in a loop quickly here. Would that help do you think @felixcheung ?

holdenk · 2018-09-21T16:39:28Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+          if (partitionCount == numPartitions) {
+            batchWriter.end()
+            out.writeInt(batchOrder.length)
+            // Batch order indices are from 0 to N-1 batches, sorted by order they arrived


How about something like // Sort by the output global batch indexes partition index, partition batch index tuple?
When I was first read this code path I got confused my self so I think we should spend a bit of time on the comment here.

holdenk · 2018-09-21T16:50:12Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-            // After last batch, end the stream
-            if (lastIndex == results.length) {
-              batchWriter.end()
+            arrowBatches.indices.foreach { i => batchOrder.append((index, i)) }


Could we call i something more descriptive like partition_batch_num or similar?

BryanCutler · 2018-10-05T19:26:43Z

Thanks for the review @holdenk ! I haven't had time to followup, but I'll take a look through this and see what I can do about making things clearer.

HyukjinKwon · 2018-10-29T10:32:54Z

retest this please

SparkQA · 2018-10-29T17:19:18Z

Test build #98202 has finished for PR 22275 at commit d6fefee.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrowStreamSerializer(Serializer):

…r indices

…-batches-SPARK-25274

BryanCutler · 2018-10-30T23:29:38Z

Apologies for the delay in circling back to this. I reorganized a little to simplify and expanded the comments to hopefully better describe the code.

A quick summary of the changes: I changed the ArrowStreamSerializer to not have any state - that seemed to complicate things. So instead of saving the batch order indices, they are loaded on the last iteration of load_stream, and this was put in a special serializer ArrowCollectSerializer so that it is clear where it is used. I also consolidated all the batch ordering calls within _collectAsArrow so it is easier to follow the whole process.

BryanCutler · 2018-10-30T23:33:07Z

python/pyspark/sql/tests.py

+            (64, 1, 64),     # Test single partition, single batch
+            (64, 1, 8),      # Test single partition, multiple batches
+            (30, 7, 2),      # Test different sized partitions
+        ]


@holdenk and @felixcheung , I didn't do a loop but chose some different levels of partition numbers to be a bit more sure that partitions won't end up in order. I also added some other cases of different partition/batch ratios. Let me know if you think we need more to be sure here.

I don't see how we're guaranteeing out-of-order from the JVM. Could we delay on one of the early partitions to guarantee out of order?

Yeah it's not a guarantee, but with a large num of partitions, it's a pretty slim chance they will all be in order. I can also add a case with some delay. My only concern is how big to make the delay to be sure it's enough without adding wasted time to the tests.

How about we keep the case with a large number of partitions and add a case with 100ms delay on the first partition?

@holdenk , I updated the tests, please take another look when you get a chance. Thanks!

I like the new tests, I think 0.1 on one of partitions is enough.

SparkQA · 2018-10-31T02:39:06Z

Test build #98284 has finished for PR 22275 at commit 7d19977.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-31T03:09:06Z

Test build #98285 has finished for PR 22275 at commit 6457e42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-07T02:48:07Z

Test build #98538 has finished for PR 22275 at commit bf2feec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…-batches-SPARK-25274

SparkQA · 2018-11-09T00:03:26Z

Test build #98624 has finished for PR 22275 at commit 725cd47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-09T03:32:20Z

Test build #98630 has finished for PR 22275 at commit 8045fac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-09T03:45:20Z

Test build #98629 has finished for PR 22275 at commit 7dc92c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-11-09T05:31:49Z

ping @HyukjinKwon and @viirya to maybe take another look at the recent changes to make this cleaner, if you are able to. Thanks!

holdenk · 2018-11-09T23:11:44Z

python/pyspark/sql/tests.py

+
+        def delay_first_part(partition_index, iterator):
+            if partition_index == 0:
+                time.sleep(0.1)


I like this :)

holdenk

I like the change to the tests, thanks @BryanCutler

viirya · 2018-11-10T01:44:19Z

LGTM the current change looks clearer. Thanks @BryanCutler

HyukjinKwon · 2018-11-11T12:25:53Z

Thanks for asking me. Will take a look within few days. Don't block because of me for clarification. I can take a look even after it got merged and we can make a followup change if there was an issue we missed.

…-batches-SPARK-25274

SparkQA · 2018-12-06T01:46:51Z

Test build #99743 has finished for PR 22275 at commit 00c7b8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-12-06T18:14:52Z

merged to master, thanks @holdenk @viirya and @felixcheung !

HyukjinKwon · 2019-02-11T07:52:33Z

Sorry @BryanCutler for my super super late input. LGTM to me as well :D.

BryanCutler · 2019-02-11T18:51:26Z

Thanks @HyukjinKwon !

…ord batches to improve performance ## What changes were proposed in this pull request? When executing `toPandas` with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in. This change sends un-ordered partitions to Python as soon as they arrive in the JVM, followed by a list of partition indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased. Followup to apache#21546 ## How was this patch tested? Added new test with a large number of batches per partition, and test that forces a small delay in the first partition. These test that partitions are collected out-of-order and then are are put in the correct order in Python. ## Performance Tests - toPandas Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each. Test code ```python df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand()) for i in range(5): start = time.time() _ = df.toPandas() elapsed = time.time() - start ``` Spark config ``` spark.driver.memory 5g spark.executor.memory 5g spark.driver.maxResultSize 2g spark.sql.execution.arrow.enabled true ``` Current Master w/ Arrow stream | This PR ---------------------|------------ 5.16207 | 4.342533 5.133671 | 4.399408 5.147513 | 4.468471 5.105243 | 4.36524 5.018685 | 4.373791 Avg Master | Avg This PR ------------------|-------------- 5.1134364 | 4.3898886 Speedup of **1.164821449** Closes apache#22275 from BryanCutler/arrow-toPandas-oo-batches-SPARK-25274. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

… Arrow enabled ## What changes were proposed in this pull request? apache#22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well. However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side. This PR fixes this by also sending the partition order if there are no partitions present. ## How was this patch tested? New unit test added. Closes apache#24650 from dvogelbacher/dv/fixNoPartitionArrowConversion. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… Arrow enabled (#625) ## What changes were proposed in this pull request? apache#22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well. However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side. This PR fixes this by also sending the partition order if there are no partitions present. ## How was this patch tested? New unit test added.

viirya reviewed Aug 29, 2018

View reviewed changes

felixcheung reviewed Sep 20, 2018

View reviewed changes

felixcheung reviewed Sep 21, 2018

View reviewed changes

holdenk reviewed Sep 21, 2018

View reviewed changes

BryanCutler added 4 commits October 30, 2018 15:58

changed toPandas to send out of order batches, followed by batch orde…

087564e

…r indices

reorg the batch order deser to simplify

6073ed9

expanded test for more cases

0d77b00

Merge remote-tracking branch 'upstream/master' into arrow-toPandas-oo…

7d19977

…-batches-SPARK-25274

BryanCutler force-pushed the arrow-toPandas-oo-batches-SPARK-25274 branch from d6fefee to 7d19977 Compare October 30, 2018 23:07

BryanCutler commented Oct 30, 2018

View reviewed changes

remove blank line

6457e42

add test case with delay

bf2feec

BryanCutler added 3 commits November 8, 2018 12:19

Merge remote-tracking branch 'upstream/master' into arrow-toPandas-oo…

725cd47

…-batches-SPARK-25274

fixed some comments

7dc92c8

fix comment

8045fac

BryanCutler changed the title ~~[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send out-of-order record batches to improve performance~~ [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance Nov 8, 2018

holdenk reviewed Nov 9, 2018

View reviewed changes

holdenk approved these changes Nov 9, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into arrow-toPandas-oo…

00c7b8c

…-batches-SPARK-25274

asfgit closed this in ecaa495 Dec 6, 2018

BryanCutler deleted the arrow-toPandas-oo-batches-SPARK-25274 branch February 11, 2019 18:50

BryanCutler mentioned this pull request Feb 13, 2019

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

Closed

dvogelbacher mentioned this pull request May 20, 2019

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

Closed

dvogelbacher mentioned this pull request May 22, 2019

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

Closed

dvogelbacher mentioned this pull request Nov 22, 2019

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled palantir/spark#625

Merged

[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance #22275

[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance #22275

Conversation

BryanCutler commented Aug 29, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Performance Tests - toPandas

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 30, 2018

BryanCutler commented Aug 30, 2018

BryanCutler commented Sep 19, 2018

holdenk commented Sep 19, 2018

felixcheung left a comment

Choose a reason for hiding this comment

BryanCutler commented Sep 21, 2018

felixcheung commented Sep 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Oct 5, 2018

HyukjinKwon commented Oct 29, 2018

SparkQA commented Oct 29, 2018

BryanCutler commented Oct 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 31, 2018

SparkQA commented Oct 31, 2018

SparkQA commented Nov 7, 2018

SparkQA commented Nov 9, 2018

SparkQA commented Nov 9, 2018

SparkQA commented Nov 9, 2018

BryanCutler commented Nov 9, 2018

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

viirya commented Nov 10, 2018

HyukjinKwon commented Nov 11, 2018 • edited Loading

SparkQA commented Dec 6, 2018

BryanCutler commented Dec 6, 2018

HyukjinKwon commented Feb 11, 2019

BryanCutler commented Feb 11, 2019

BryanCutler commented Aug 29, 2018 •

edited

Loading

BryanCutler Nov 6, 2018 •

edited

Loading

HyukjinKwon commented Nov 11, 2018 •

edited

Loading