-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129
Conversation
def explain( | ||
self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] = None | ||
) -> None: | ||
if extended is not None and mode is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move such complex preprocessing to the superclasses later
d70c9f1
to
9f1f1bd
Compare
06a28df
to
692a302
Compare
Will fix up the tests soon. |
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": | |||
|
|||
active.__doc__ = PySparkSession.active.__doc__ | |||
|
|||
def table(self, tableName: str) -> DataFrame: | |||
def table(self, tableName: str) -> ParentDataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can leave it as-is? And the following changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the way MyPy least complained IIRC. Let me take a look again ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the arguments cannot be more specific type, and return types can't be wider types (https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides). So it complains about the argument.
Let me just keep them all as parent dataframe for simplicity because those types aren't user-facing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one example of the error:
python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of "exceptAll" is incompatible with supertype "DataFrame"; supertype defines the argument type as "DataFrame" [override]
@overload | ||
def repartition(self, numPartitions: int, *cols: "ColumnOrName") -> "ParentDataFrame": | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we need @overload
definitions in the subclasses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially added, and removed it back because MyPy complains too much. I will take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like by right we should redefine the overloads here (python/mypy#5146, python/mypy#10699). However, we're using pyspark.sql.DataFrame
type hints even within our codebase .. so I think it's better to don't have them defined here for now.
6a64c37
to
253acad
Compare
Should be ready for a look. All tests passed. I squashed/rebased the commits. |
863bf1e
to
f0d4d57
Compare
@@ -333,6 +333,11 @@ def test_observe(self): | |||
from pyspark.sql.connect.observation import Observation | |||
|
|||
class MockDF(DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a breaking change if somebody inherits pyspark.sql.DataFrame
before, and it has it's own __init__
. However, __init__
is not really an API, and users shouldn't really customize/use/invoke them directly.
f0d4d57
to
d6f897a
Compare
d6f897a
to
a94f76b
Compare
Merged to master. I will followup if there are more comments to address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a different name than classic
?
|
…c` references ### What changes were proposed in this pull request? This PR is a followup of #46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package). ### Why are the changes needed? To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46171 from HyukjinKwon/SPARK-47909-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… Classic ### What changes were proposed in this pull request? Same as #46129 but for `Column` class. ### Why are the changes needed? Same as #46129 ### Does this PR introduce _any_ user-facing change? Same as #46129 ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46155 from HyukjinKwon/SPARK-47933. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ct and Spark Classic ### What changes were proposed in this pull request? This PR proposes to have a parent `pyspark.sql.DataFrame` class which `pyspark.sql.connect.dataframe.DataFrame` and `pyspark.sql.classic.dataframe.DataFrame` inherit. **Note** that for backward compatibility concern, `pyspark.sql.DataFrame(...)` will return still a Spark Classic DataFrame. Before 1. `pyspark.sql.DataFrame` (Spark Claasic) - docstrings - Spark Classic logic 2. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect) - Spark Connect logic 3. Users can only see the type hints from `pyspark.sql.DataFrame`. After 1. `pyspark.sql.DataFrame` (Common) - docstrings - Support classmethod usages (dispatch to either Spark Connect or Spark Classic) 2. `pyspark.sql.classic.dataframe.DataFrame` (Spark Classic) - Spark Classic logic 3. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect) - Spark Connect logic 4. Users can only see the type hints from `pyspark.sql.DataFrame`. ### Why are the changes needed? This fixes two issues in the current structure at Spark Connect: Support usage of regular methods as class methods, e.g., ```python from pyspark.sql import DataFrame df = spark.range(10) DataFrame.union(df, df) ``` Before ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union return DataFrame(self._jdf.union(other._jdf), self.sparkSession) ^^^^^^^^^ File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in __getattr__ raise PySparkAttributeError( pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail. ``` After ``` DataFrame[id: bigint] ``` Supports `isinstance` call ```python from pyspark.sql import DataFrame isinstance(spark.range(1), DataFrame) ``` Before ``` False ``` After ``` True ``` ### Does this PR introduce _any_ user-facing change? Yes, as described above. ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46129 from HyukjinKwon/SPARK-47909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…c` references ### What changes were proposed in this pull request? This PR is a followup of apache#46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package). ### Why are the changes needed? To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46171 from HyukjinKwon/SPARK-47909-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… Classic ### What changes were proposed in this pull request? Same as apache#46129 but for `Column` class. ### Why are the changes needed? Same as apache#46129 ### Does this PR introduce _any_ user-facing change? Same as apache#46129 ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46155 from HyukjinKwon/SPARK-47933. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…and Spark Classic ### What changes were proposed in this pull request? Parent Window class for Spark Connect and Spark Classic ### Why are the changes needed? Same as #46129 ### Does this PR introduce _any_ user-facing change? Same as #46129 ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46841 from zhengruifeng/py_parent_window. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…and Spark Classic ### What changes were proposed in this pull request? Parent Window class for Spark Connect and Spark Classic ### Why are the changes needed? Same as apache#46129 ### Does this PR introduce _any_ user-facing change? Same as apache#46129 ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46841 from zhengruifeng/py_parent_window. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to have a parent
pyspark.sql.DataFrame
class whichpyspark.sql.connect.dataframe.DataFrame
andpyspark.sql.classic.dataframe.DataFrame
inherit.Note that for backward compatibility concern,
pyspark.sql.DataFrame(...)
will return still a Spark Classic DataFrame.Before
pyspark.sql.DataFrame
(Spark Claasic)pyspark.sql.connect.dataframe.DataFrame
(Spark Connect)Users can only see the type hints from
pyspark.sql.DataFrame
.After
pyspark.sql.DataFrame
(Common)pyspark.sql.classic.dataframe.DataFrame
(Spark Classic)pyspark.sql.connect.dataframe.DataFrame
(Spark Connect)Users can only see the type hints from
pyspark.sql.DataFrame
.Why are the changes needed?
This fixes two issues in the current structure at Spark Connect:
Support usage of regular methods as class methods, e.g.,
Before
After
Supports
isinstance
callBefore
After
Does this PR introduce any user-facing change?
Yes, as described above.
How was this patch tested?
Manually tested, and CI should verify them.
Was this patch authored or co-authored using generative AI tooling?
No.