[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

xinrong-meng · 2023-04-10T17:54:34Z

What changes were proposed in this pull request?

Implement Arrow-optimized Python UDFs in Spark Connect.

Please see #39384 for motivation and performance improvements of Arrow-optimized Python UDFs.

Why are the changes needed?

Parity with vanilla PySpark.

Does this PR introduce any user-facing change?

Yes. In Spark Connect Python Client, users can:

Set useArrow parameter True to enable Arrow optimization for a specific Python UDF.

>>> df = spark.range(2)
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show()
+------------+                                                                  
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#18 AS <lambda>(id)#16]
+- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200
   +- *(1) Range (0, 2, step=1, splits=1)

Enable spark.sql.execution.pythonUDF.arrow.enabled Spark Conf to make all Python UDFs Arrow-optimized.

>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
>>> df.select(udf(lambda x : x + 1)('id')).show()
+------------+                                                                  
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#30 AS <lambda>(id)#28]
+- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200
   +- *(1) Range (0, 2, step=1, splits=1)

How was this patch tested?

Parity unit tests.

SPARK-40307

xinrong-meng · 2023-04-10T18:29:49Z

python/pyspark/sql/udf.py

    else:
        return regular_udf


+def _create_arrow_py_udf(f, regular_udf):  # type: ignore


Ignoring the type annotations of _create_arrow_py_udf because it is shared between vanilla PySpark and Spark Connect Python Client.

The function is only an extraction of original code L142 - L179 for code reuse.

xinrong-meng · 2023-04-10T18:35:44Z

python/pyspark/sql/connect/udf.py

+            else useArrow
+        )
+
+    regular_udf = _create_udf(f, returnType, evalType)


There is duplicated code in _create_py_udf between Spark Connect Python Client and vanilla PySpark, except for fetching the active SparkSession.
However, for a clear code path separation and abstraction, I decided not to refactor it for now.

xinrong-meng · 2023-04-11T00:53:12Z

CI failed because of

Run echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
fatal: detected dubious ownership in repository at '/__w/spark/spark'
To add an exception for this directory, call:

	git config --global --add safe.directory /__w/spark/spark
fatal: detected dubious ownership in repository at '/__w/spark/spark'
To add an exception for this directory, call:

	git config --global --add safe.directory /__w/spark/spark
Error: Process completed with exit code 128.

xinrong-meng · 2023-04-18T00:50:00Z

@HyukjinKwon @zhengruifeng Would you please take a look? Thank you!

HyukjinKwon · 2023-04-18T00:50:59Z

cc @ueshin FYI

python/pyspark/sql/connect/udf.py

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py

zhengruifeng · 2023-04-20T07:58:09Z

python/pyspark/sql/udf.py

+    import pandas as pd
+    from pyspark.sql.pandas.functions import _create_pandas_udf
+
+    return_type = regular_udf.returnType


it seems that the regular_udf is only used to pass the returnType and evalType ?

And regular_udf.func based on the updated code.

python/pyspark/sql/udf.py

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py

python/pyspark/sql/tests/test_arrow_python_udf.py

python/pyspark/sql/connect/udf.py

HyukjinKwon · 2023-04-22T00:30:44Z

Merged to master.

github-actions bot added CONNECT CORE PYTHON SQL labels Apr 10, 2023

xinrong-meng commented Apr 10, 2023

View reviewed changes

xinrong-meng added 7 commits April 17, 2023 13:55

_create_arrow_py_udf

63bb36b

in Connect

d702b67

tests

01f7190

- debug

0fb7712

docstrings

f46d006

TEST

3abeef4

TEST

f6fc6e1

xinrong-meng force-pushed the connect_arrow_py_udf branch from 95cad25 to f6fc6e1 Compare April 17, 2023 20:56

HyukjinKwon approved these changes Apr 18, 2023

View reviewed changes

rmv duplicate test

63ef94e

zhengruifeng reviewed Apr 20, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-43082][Connect][PYTHON] Arrow-optimized Python UDFs in Spark Connect~~ [SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect Apr 20, 2023

xinrong-meng added 2 commits April 20, 2023 10:51

tearDownClass

86938d5

rmv f from _create_arrow_py_udf

f5aef18

ueshin reviewed Apr 20, 2023

View reviewed changes

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/test_arrow_python_udf.py Outdated Show resolved Hide resolved

xinrong-meng added 2 commits April 20, 2023 11:18

UserWarning

f313063

finally super tearDownClass

5e78632

ueshin reviewed Apr 20, 2023

View reviewed changes

python/pyspark/sql/connect/udf.py Show resolved Hide resolved

fallback to regular udf

ac86bf1

xinrong-meng requested review from ueshin and zhengruifeng April 21, 2023 18:01

ueshin approved these changes Apr 21, 2023

View reviewed changes

HyukjinKwon closed this in f29502a Apr 22, 2023

ueshin mentioned this pull request Aug 19, 2023

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

xinrong-meng commented Apr 10, 2023 •

edited

Loading

xinrong-meng Apr 10, 2023

xinrong-meng Apr 10, 2023

xinrong-meng Apr 10, 2023

xinrong-meng commented Apr 11, 2023

xinrong-meng commented Apr 18, 2023

HyukjinKwon commented Apr 18, 2023

zhengruifeng Apr 20, 2023

xinrong-meng Apr 20, 2023

HyukjinKwon commented Apr 22, 2023

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

Conversation

xinrong-meng commented Apr 10, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

xinrong-meng commented Apr 11, 2023

xinrong-meng commented Apr 18, 2023

HyukjinKwon commented Apr 18, 2023

zhengruifeng Apr 20, 2023

Choose a reason for hiding this comment

xinrong-meng Apr 20, 2023

Choose a reason for hiding this comment

HyukjinKwon commented Apr 22, 2023

xinrong-meng commented Apr 10, 2023 •

edited

Loading