[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

cfmcgrady · 2023-04-04T07:23:01Z

Why are the changes needed?

The fundamental concept is to execute a job similar to the way in which CollectLimitExec.executeCollect() operates.

select * from parquet.`parquet/tpcds/sf1000/catalog_sales` limit 20;

Before this PR:

After this PR:

How was this patch tested?

Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request

cfmcgrady · 2023-04-04T07:34:34Z

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

+
+  def executeArrowBatchCollect: SparkPlan => Array[Array[Byte]] = {
+    case adaptiveSparkPlan: AdaptiveSparkPlanExec =>
+      executeArrowBatchCollect(adaptiveSparkPlan.finalPhysicalPlan)


the AdaptiveSparkPlanExec.finalPhysicalPlan function was introduced in SPARK-41914, and it may present compatibility issues if the underlying Spark runtime lacks the corresponding patch.

shall we reflect some related private method to workaround ? it's unacceptable if we break the compatibility.

have changed to reflective call function adaptiveSparkPlan.finalPhysicalPlan.

cfmcgrady · 2023-04-04T07:40:19Z

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

+      }
+      i += 1
+    }
+    result.toArray


add offset support in the separate PR to adapt Spark-3.4.0

cfmcgrady · 2023-04-04T07:41:22Z

cc @yaooqinn @ulysses-you @pan3793 @turboFei @cxzl25

ulysses-you

can we unfiy the class name a bit more ? not see the difference between xxxutils and xxx helper... logically, most of those methods should be private.

ulysses-you · 2023-04-04T09:26:17Z

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersHelper.scala

+          estimatedBatchSize += (row match {
+            case ur: UnsafeRow => ur.getSizeInBytes
+            // Trying to estimate the size of the current row, assuming 16 bytes per value.
+            case ir: InternalRow => ir.numFields * 16


in general, we can infer row size by schema.defaultSize

sorry for the lack of documentation.
This class ArrowBatchIterator is derived from org.apache.spark.sql.execution.arrow.ArrowConverters.ArrowBatchWithSchemaIterator, with two key differences:

there is no requirement to write the schema at the batch header

iteration halts when rowCount equals limit

here is the diff, compare with latest spark master branch https://github.com/apache/spark/blob/3c189abd73afa998e8573cbfdaf0f72445284314/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

- private[sql] class ArrowBatchWithSchemaIterator( + private[sql] class ArrowBatchIterator( rowIter: Iterator[InternalRow], schema: StructType, maxRecordsPerBatch: Long, maxEstimatedBatchSize: Long, + limit: Long, timeZoneId: String, context: TaskContext) - extends ArrowBatchIterator( - rowIter, schema, maxRecordsPerBatch, timeZoneId, context) { + extends Iterator[Array[Byte]] { + - private val arrowSchemaSize = SizeEstimator.estimate(arrowSchema) var rowCountInLastBatch: Long = 0 + var rowCount: Long = 0 override def next(): Array[Byte] = { val out = new ByteArrayOutputStream() val writeChannel = new WriteChannel(Channels.newChannel(out)) rowCountInLastBatch = 0 - var estimatedBatchSize = arrowSchemaSize + var estimatedBatchSize = 0 Utils.tryWithSafeFinally { - // Always write the schema. - MessageSerializer.serialize(writeChannel, arrowSchema) // Always write the first row. while (rowIter.hasNext && ( @@ -31,15 +30,17 @@ estimatedBatchSize < maxEstimatedBatchSize || // If the size of rows are 0 or negative, unlimit it. maxRecordsPerBatch <= 0 || - rowCountInLastBatch < maxRecordsPerBatch)) { + rowCountInLastBatch < maxRecordsPerBatch || + rowCount < limit)) { val row = rowIter.next() arrowWriter.write(row) estimatedBatchSize += (row match { case ur: UnsafeRow => ur.getSizeInBytes - // Trying to estimate the size of the current row, assuming 16 bytes per value. - case ir: InternalRow => ir.numFields * 16 + // Trying to estimate the size of the current row + case _: InternalRow => schema.defaultSize }) rowCountInLastBatch += 1 + rowCount += 1 } arrowWriter.finish() val batch = unloader.getRecordBatch()

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersHelper.scala

ulysses-you · 2023-04-04T09:35:29Z

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

+
+  def executeArrowBatchCollect: SparkPlan => Array[Array[Byte]] = {
+    case adaptiveSparkPlan: AdaptiveSparkPlanExec =>
+      executeArrowBatchCollect(adaptiveSparkPlan.finalPhysicalPlan)


shall we reflect some related private method to workaround ? it's unacceptable if we break the compatibility.

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

codecov-commenter · 2023-04-06T04:25:06Z

Codecov Report

Merging #4662 (82c912e) into master (d9e14f2) will increase coverage by 0.25%.
The diff coverage is 85.91%.

@@             Coverage Diff              @@
##             master    #4662      +/-   ##
============================================
+ Coverage     57.60%   57.85%   +0.25%     
  Complexity       13       13              
============================================
  Files           579      580       +1     
  Lines         31951    32212     +261     
  Branches       4269     4304      +35     
============================================
+ Hits          18404    18635     +231     
+ Misses        11785    11780       -5     
- Partials       1762     1797      +35

Impacted Files	Coverage Δ
...uubi/engine/spark/operation/ExecuteStatement.scala	`82.60% <50.00%> (+1.80%)`	⬆️
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala	`80.00% <82.60%> (+1.42%)`	⬆️
...rk/sql/execution/arrow/KyuubiArrowConverters.scala	`88.57% <88.57%> (ø)`

... and 23 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

cfmcgrady · 2023-04-06T04:29:36Z

can we unfiy the class name a bit more ? not see the difference between xxxutils and xxx helper... logically, most of those methods should be private.

have completed the refactoring, please take another look when you have time. Thank you. @ulysses-you

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

ulysses-you · 2023-04-07T01:22:38Z

thank you @cfmcgrady , lgtm if tests pass

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

externals/kyuubi-spark-sql-engine/pom.xml

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

pan3793 · 2023-04-07T16:24:55Z

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

+    val in = new ByteArrayInputStream(bytes)
+    val out = new ByteArrayOutputStream(bytes.length)
+
+    val rootAllocator = ArrowUtils.rootAllocator.newChildAllocator(


The name "rootAllocator" is not suitable then, and why should we create a child allocator?

I think it's for debugging.

Memory was leaked by query. Memory leaked: (128) Allocator(slice) 0/128/128/9223372036854775807 (res/actual/peak/limit) java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128) Allocator(slice) 0/128/128/9223372036854775807 (res/actual/peak/limit) at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437) at org.apache.spark.sql.execution.arrow.KyuubiArrowConverters$.slice(KyuubiArrowConverters.scala:91) at org.apache.spark.sql.kyuubi.SparkDatasetHelper$.doCollectLimit(SparkDatasetHelper.scala:170) at org.apache.spark.sql.kyuubi.SparkDatasetHelper$.$anonfun$executeArrowBatchCollect$1(SparkDatasetHelper.scala:51) at org.apache.spark.sql.kyuubi.SparkDatasetHelper$

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala

pan3793

LGTM if CI pass

turboFei

Great work， thanks

ulysses-you · 2023-04-10T01:43:15Z

thanks, merging to master/branch-1.7

… shuffle for outermost limit ### _Why are the changes needed?_ The fundamental concept is to execute a job similar to the way in which `CollectLimitExec.executeCollect()` operates. ```sql select * from parquet.`parquet/tpcds/sf1000/catalog_sales` limit 20; ``` Before this PR: ![截屏2023-04-04 下午3 20 34](https://user-images.githubusercontent.com/8537877/229717946-87c480c6-9550-4d00-bc96-14d59d7ce9f7.png) ![截屏2023-04-04 下午3 20 54](https://user-images.githubusercontent.com/8537877/229717973-bf6da5af-74e7-422a-b9fa-8b7bebd43320.png) After this PR: ![截屏2023-04-04 下午3 17 05](https://user-images.githubusercontent.com/8537877/229718016-6218d019-b223-4deb-b596-6f0431d33d2a.png) ![截屏2023-04-04 下午3 17 16](https://user-images.githubusercontent.com/8537877/229718046-ea07cd1f-5ffc-42ba-87d5-08085feb4b16.png) ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request Closes #4662 from cfmcgrady/arrow-collect-limit-exec-2. Closes #4662 82c912e [Fu Chen] close vector 130bcb1 [Fu Chen] finally close facc13f [Fu Chen] exclude rule OptimizeLimitZero 3700839 [Fu Chen] SparkArrowbasedOperationSuite adapt Spark-3.1.x 6064ab9 [Fu Chen] limit = 0 test case 6d596fc [Fu Chen] address comment 8280783 [Fu Chen] add `isStaticConfigKey` to adapt Spark-3.1.x 22cc70f [Fu Chen] add ut b72bc6f [Fu Chen] add offset support to adapt Spark-3.4.x 9ffb44f [Fu Chen] make toBatchIterator private c83cf3f [Fu Chen] SparkArrowbasedOperationSuite adapt Spark-3.1.x 573a262 [Fu Chen] fix 4cef204 [Fu Chen] SparkArrowbasedOperationSuite adapt Spark-3.1.x d70aee3 [Fu Chen] SparkPlan.session -> SparkSession.active to adapt Spark-3.1.x e3bf84c [Fu Chen] refactor 81886f0 [Fu Chen] address comment 2286afc [Fu Chen] reflective calla AdaptiveSparkPlanExec.finalPhysicalPlan 03d0747 [Fu Chen] address comment 25e4f05 [Fu Chen] add docs 885cf2c [Fu Chen] infer row size by schema.defaultSize 4e7ca54 [Fu Chen] unnecessarily changes ee5a756 [Fu Chen] revert unnecessarily changes 6c5b1eb [Fu Chen] add ut 4212a89 [Fu Chen] refactor and add ut ed8c692 [Fu Chen] refactor 0088671 [Fu Chen] refine 8593d85 [Fu Chen] driver slice last batch a584943 [Fu Chen] arrow take Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: ulyssesyou <ulyssesyou@apache.org> (cherry picked from commit 1a65125) Signed-off-by: ulyssesyou <ulyssesyou@apache.org>

cfmcgrady added 8 commits April 4, 2023 15:03

arrow take

a584943

driver slice last batch

8593d85

refine

0088671

refactor

ed8c692

refactor and add ut

4212a89

add ut

6c5b1eb

revert unnecessarily changes

ee5a756

unnecessarily changes

4e7ca54

github-actions bot added the module:spark label Apr 4, 2023

cfmcgrady commented Apr 4, 2023

View reviewed changes

ulysses-you reviewed Apr 4, 2023

View reviewed changes

cfmcgrady added 2 commits April 4, 2023 19:46

infer row size by schema.defaultSize

885cf2c

add docs

25e4f05

pan3793 reviewed Apr 6, 2023

View reviewed changes

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala Outdated Show resolved Hide resolved

cfmcgrady added 4 commits April 6, 2023 10:07

address comment

03d0747

reflective calla AdaptiveSparkPlanExec.finalPhysicalPlan

2286afc

address comment

81886f0

refactor

e3bf84c

cfmcgrady added 4 commits April 6, 2023 14:23

SparkPlan.session -> SparkSession.active to adapt Spark-3.1.x

d70aee3

SparkArrowbasedOperationSuite adapt Spark-3.1.x

4cef204

fix

573a262

SparkArrowbasedOperationSuite adapt Spark-3.1.x

c83cf3f

ulysses-you reviewed Apr 6, 2023

View reviewed changes

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Outdated Show resolved Hide resolved

ulysses-you reviewed Apr 6, 2023

View reviewed changes

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala Outdated Show resolved Hide resolved

ulysses-you reviewed Apr 6, 2023

View reviewed changes

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala Outdated Show resolved Hide resolved

make toBatchIterator private

9ffb44f

add ut

22cc70f

github-actions bot added the kind:build label Apr 6, 2023

pan3793 reviewed Apr 7, 2023

View reviewed changes

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala Outdated Show resolved Hide resolved

pan3793 reviewed Apr 7, 2023

View reviewed changes

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Outdated Show resolved Hide resolved

add isStaticConfigKey to adapt Spark-3.1.x

8280783

pan3793 reviewed Apr 7, 2023

View reviewed changes

externals/kyuubi-spark-sql-engine/pom.xml Show resolved Hide resolved

pan3793 reviewed Apr 7, 2023

View reviewed changes

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala Show resolved Hide resolved

pan3793 reviewed Apr 7, 2023

View reviewed changes

.../src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala Outdated Show resolved Hide resolved

cfmcgrady added 2 commits April 7, 2023 10:46

address comment

6d596fc

limit = 0 test case

6064ab9

pan3793 approved these changes Apr 7, 2023

View reviewed changes

SparkArrowbasedOperationSuite adapt Spark-3.1.x

3700839

cxzl25 approved these changes Apr 7, 2023

View reviewed changes

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Show resolved Hide resolved

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Show resolved Hide resolved

cfmcgrady added 2 commits April 7, 2023 16:59

exclude rule OptimizeLimitZero

facc13f

finally close

130bcb1

pan3793 reviewed Apr 7, 2023

View reviewed changes

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Outdated Show resolved Hide resolved

pan3793 reviewed Apr 7, 2023

View reviewed changes

...k-sql-engine/src/main/scala/org/apache/spark/sql/execution/arrow/KyuubiArrowConverters.scala Outdated Show resolved Hide resolved

close vector

82c912e

pan3793 approved these changes Apr 8, 2023

View reviewed changes

turboFei approved these changes Apr 8, 2023

View reviewed changes

ulysses-you approved these changes Apr 10, 2023

View reviewed changes

ulysses-you closed this in 1a65125 Apr 10, 2023

ulysses-you assigned cfmcgrady Apr 10, 2023

ulysses-you added this to the v1.7.1 milestone Apr 10, 2023

cfmcgrady deleted the arrow-collect-limit-exec-2 branch April 10, 2023 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

cfmcgrady commented Apr 4, 2023

cfmcgrady Apr 4, 2023

ulysses-you Apr 4, 2023 •

edited

Loading

cfmcgrady Apr 6, 2023

cfmcgrady Apr 4, 2023

cfmcgrady commented Apr 4, 2023

ulysses-you left a comment

ulysses-you Apr 4, 2023

cfmcgrady Apr 4, 2023

cfmcgrady Apr 4, 2023

ulysses-you Apr 4, 2023 •

edited

Loading

codecov-commenter commented Apr 6, 2023 •

edited

Loading

cfmcgrady commented Apr 6, 2023

ulysses-you commented Apr 7, 2023

pan3793 Apr 7, 2023 •

edited

Loading

cfmcgrady Apr 8, 2023

pan3793 left a comment

turboFei left a comment

ulysses-you commented Apr 10, 2023 •

edited

Loading

[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

Conversation

cfmcgrady commented Apr 4, 2023

Why are the changes needed?

How was this patch tested?

cfmcgrady Apr 4, 2023

Choose a reason for hiding this comment

ulysses-you Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

cfmcgrady Apr 6, 2023

Choose a reason for hiding this comment

cfmcgrady Apr 4, 2023

Choose a reason for hiding this comment

cfmcgrady commented Apr 4, 2023

ulysses-you left a comment

Choose a reason for hiding this comment

ulysses-you Apr 4, 2023

Choose a reason for hiding this comment

cfmcgrady Apr 4, 2023

Choose a reason for hiding this comment

cfmcgrady Apr 4, 2023

Choose a reason for hiding this comment

ulysses-you Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Apr 6, 2023 • edited Loading

Codecov Report

cfmcgrady commented Apr 6, 2023

ulysses-you commented Apr 7, 2023

pan3793 Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

cfmcgrady Apr 8, 2023

Choose a reason for hiding this comment

pan3793 left a comment

Choose a reason for hiding this comment

turboFei left a comment

Choose a reason for hiding this comment

ulysses-you commented Apr 10, 2023 • edited Loading

ulysses-you Apr 4, 2023 •

edited

Loading

ulysses-you Apr 4, 2023 •

edited

Loading

codecov-commenter commented Apr 6, 2023 •

edited

Loading

pan3793 Apr 7, 2023 •

edited

Loading

ulysses-you commented Apr 10, 2023 •

edited

Loading