[SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` #38318

zhengruifeng · 2022-10-20T09:52:22Z

What changes were proposed in this pull request?

Implement DataFrame.summary

there is a set of DataFrame APIs implemented in StatFunctions, DataFrameStatFunctions and DataFrameNaFunctions, which I think can not be implemented in connect client due to:

depend on Catalyst's analysis (most of them);
~~2. implemented in RDD operations (like summary,approxQuantile);~~ (resolved by reimpl)
~~3. internally trigger jobs (like summary);~~ (resolved by reimpl)

This PR introduced a new proto StatFunction to support StatFunctions method

Why are the changes needed?

for Connect API coverage

Does this PR introduce any user-facing change?

yes, new API

How was this patch tested?

added UT

zhengruifeng · 2022-10-21T06:56:46Z

cc @HyukjinKwon @cloud-fan @amaliujia @grundprinzip PTAL

HyukjinKwon · 2022-10-21T08:09:38Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)


This is fine for now but it's going to truncate the SQL plans that disable further optimization. We should probably add dedicated plans for def summary in Dataset itself.

For now, LGTM

yes, then it will have more optimization space. let us add new plan for it. thanks

+1!

I don't know how to add a new plan. It would be very useful to have a PR as an example.

cc @cloud-fan @HyukjinKwon
since we had reimplemented the df.summary 6a0713a, are there some differences in sql optimization between this method (directly invoke df.summary) and adding a dedicated plan?

Some rules may not work as they don't recognize the new plan.

what do you mean by truncate the SQL plans? DataFrame transformations just accumulate the logical plan.

@cloud-fan the old df.summary eagerly compute the statistics and always return a LocalRelation

Oh that's an issue. Can it be solved by updating df.summary implementation?

yes, it has been resolved

In the new impl

zhengruifeng · 2022-11-07T06:14:08Z

cc @cloud-fan @HyukjinKwon @amaliujia

zhengruifeng · 2022-11-07T06:26:42Z

python/pyspark/sql/connect/dataframe.py

@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
    def where(self, condition: Expression) -> "DataFrame":
        return self.filter(condition)

+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)


different from

spark/python/pyspark/sql/dataframe.py

Lines 2575 to 2576 in 29e4552

if len(statistics) == 1 and isinstance(statistics[0], list):

statistics = statistics[0]

since i think that preprocessing weird

amaliujia

LGTM with one comment

amaliujia · 2022-11-07T18:25:03Z

python/pyspark/sql/connect/dataframe.py

@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
    def where(self, condition: Expression) -> "DataFrame":
        return self.filter(condition)

+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)


Given def summary(self, *statistics: str) -> "DataFrame": is public API so there could be some mis-use passing into non-str parameters for statistics, is this common practice to just assert without giving a message?

In contrast, I guess the assert in plan.py is ok because that is internal API and we can implement it right so assert on unexpected things (and developer can fix when it really happens).

def __init__(self, child: Optional["LogicalPlan"], function: str, **kwargs: Any) -> None: super().__init__(child) assert function in ["summary"] self.function = function

agree, should give the error message, will update

cloud-fan · 2022-11-08T01:41:19Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;


why do we need this?

here follows https://github.com/apache/spark/blob/master/connector/connect/src/main/protobuf/spark/connect/relations.proto#L51 to catch unexpected input

I think that's for enum but here is an optional field... cc @amaliujia

Question: will we add new functions under this oneof?

yes, such as crosstab cov corr etc

ok then this makes sense

cloud-fan · 2022-11-08T01:43:30Z

...nect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala

+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite


why do we need a new test suite?

i encounter some problems in adding test in existing suites since summary requires a session and need to analyze the plan;
this suite will also cover some eargly computed stat functions (cov corr) in the future

let me take another look

removed, add another test in existing suites

Initially I thought a separate suite was ok because that suite actually just executed the proto plan to check expected result.

Now it's changed to plan comparison based test, which is ok.

tests were improved a lot since the time i sent this pr :)

xxx add scala test fix lint resolve conflict fix scala tests add error msg

cloud-fan · 2022-11-09T01:19:16Z

thanks, merging to master!

zhengruifeng · 2022-11-09T01:23:25Z

thanks @cloud-fan @HyukjinKwon @amaliujia for reviews!

…implement `DataFrame.summary` ### What changes were proposed in this pull request? Implement `DataFrame.summary` there is a set of DataFrame APIs implemented in [`StatFunctions`](https://github.com/apache/spark/blob/9cae423075145d3dd81d53f4b82d4f2af6fe7c15/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala), [`DataFrameStatFunctions`](https://github.com/apache/spark/blob/b69c26833c99337bb17922f21dd72ee3a12e0c0a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala) and [`DataFrameNaFunctions`](https://github.com/apache/spark/blob/5d74ace648422e7a9bff7774ac266372934023b9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala), which I think can not be implemented in connect client due to: 1. depend on Catalyst's analysis (most of them); ~~2. implemented in RDD operations (like `summary`,`approxQuantile`);~~ (resolved by reimpl) ~~3. internally trigger jobs (like `summary`);~~ (resolved by reimpl) This PR introduced a new proto `StatFunction` to support `StatFunctions` method ### Why are the changes needed? for Connect API coverage ### Does this PR introduce _any_ user-facing change? yes, new API ### How was this patch tested? added UT Closes apache#38318 from zhengruifeng/connect_df_summary. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 20, 2022

zhengruifeng force-pushed the connect_df_summary branch from 329a2b8 to 9897cdb Compare October 21, 2022 05:50

zhengruifeng changed the title ~~[SPARK-40852][CONNECT][PYTHON][WIP] Implement DataFrame.summary~~ [SPARK-40852][CONNECT][PYTHON][WIP] Introduce DataFrameFunction in proto and implement DataFrame.summary Oct 21, 2022

zhengruifeng changed the title ~~[SPARK-40852][CONNECT][PYTHON][WIP] Introduce DataFrameFunction in proto and implement DataFrame.summary~~ [SPARK-40852][CONNECT][PYTHON] Introduce DataFrameFunction in proto and implement DataFrame.summary Oct 21, 2022

zhengruifeng marked this pull request as ready for review October 21, 2022 06:56

HyukjinKwon reviewed Oct 21, 2022

View reviewed changes

zhengruifeng closed this Oct 21, 2022

zhengruifeng deleted the connect_df_summary branch October 22, 2022 01:52

zhengruifeng restored the connect_df_summary branch October 27, 2022 03:05

zhengruifeng reopened this Oct 27, 2022

zhengruifeng force-pushed the connect_df_summary branch from 1787ea8 to 4727094 Compare October 27, 2022 08:24

zhengruifeng mentioned this pull request Nov 7, 2022

[SPARK-40917][SQL] Add a dedicated logical plan for Summary #38395

Closed

zhengruifeng force-pushed the connect_df_summary branch from 4727094 to 10c0968 Compare November 7, 2022 02:37

zhengruifeng changed the title ~~[SPARK-40852][CONNECT][PYTHON] Introduce DataFrameFunction in proto and implement DataFrame.summary~~ [SPARK-40852][CONNECT][PYTHON] Introduce StatFunction in proto and implement DataFrame.summary Nov 7, 2022

zhengruifeng commented Nov 7, 2022

View reviewed changes

amaliujia reviewed Nov 7, 2022

View reviewed changes

cloud-fan reviewed Nov 8, 2022

View reviewed changes

zhengruifeng added 2 commits November 8, 2022 13:22

nit

981b2d9

xxx add scala test fix lint resolve conflict fix scala tests add error msg

update test

8e7f85b

zhengruifeng force-pushed the connect_df_summary branch from 8770d9b to 8e7f85b Compare November 8, 2022 06:16

cloud-fan approved these changes Nov 9, 2022

View reviewed changes

cloud-fan closed this in 4f096db Nov 9, 2022

zhengruifeng deleted the connect_df_summary branch November 9, 2022 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` #38318

[SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` #38318

zhengruifeng commented Oct 20, 2022 •

edited

Loading

zhengruifeng commented Oct 21, 2022

HyukjinKwon Oct 21, 2022

zhengruifeng Oct 21, 2022

amaliujia Oct 21, 2022

zhengruifeng Oct 27, 2022

cloud-fan Oct 27, 2022

cloud-fan Oct 27, 2022

zhengruifeng Oct 27, 2022

cloud-fan Oct 27, 2022

zhengruifeng Oct 27, 2022

zhengruifeng Oct 27, 2022

zhengruifeng commented Nov 7, 2022

zhengruifeng Nov 7, 2022

amaliujia left a comment

amaliujia Nov 7, 2022

amaliujia Nov 7, 2022 •

edited

Loading

zhengruifeng Nov 8, 2022

cloud-fan Nov 8, 2022

zhengruifeng Nov 8, 2022

cloud-fan Nov 8, 2022

amaliujia Nov 8, 2022

zhengruifeng Nov 9, 2022

cloud-fan Nov 9, 2022

cloud-fan Nov 8, 2022

zhengruifeng Nov 8, 2022

zhengruifeng Nov 8, 2022

zhengruifeng Nov 8, 2022

amaliujia Nov 8, 2022

zhengruifeng Nov 9, 2022

cloud-fan commented Nov 9, 2022

zhengruifeng commented Nov 9, 2022

	if len(statistics) == 1 and isinstance(statistics[0], list):
	statistics = statistics[0]

[SPARK-40852][CONNECT][PYTHON] Introduce StatFunction in proto and implement DataFrame.summary #38318

[SPARK-40852][CONNECT][PYTHON] Introduce StatFunction in proto and implement DataFrame.summary #38318

Conversation

zhengruifeng commented Oct 20, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Oct 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Nov 7, 2022

Choose a reason for hiding this comment

amaliujia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 9, 2022

zhengruifeng commented Nov 9, 2022

[SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` #38318

[SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` #38318

zhengruifeng commented Oct 20, 2022 •

edited

Loading

amaliujia Nov 7, 2022 •

edited

Loading