-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6055] [PySpark] fix incorrect __eq__ of DataType #4808
Conversation
This PR works for 1.3+, will create another PR for 1.2 and 1.1 |
Test build #28053 has started for PR 4808 at commit
|
@@ -620,93 +619,6 @@ def _get_hive_ctx(self): | |||
return self._jvm.HiveContext(self._jsc.sc()) | |||
|
|||
|
|||
def _create_row(fields, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are duplicated, also in types.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good catch.
Test build #28053 has finished for PR 4808 at commit
|
Test FAILed. |
Test build #28072 has started for PR 4808 at commit
|
Test build #28072 has finished for PR 4808 at commit
|
Test PASSed. |
Test build #28079 has started for PR 4808 at commit
|
Test build #28079 has finished for PR 4808 at commit
|
Test FAILed. |
Test build #28084 has started for PR 4808 at commit
|
Test build #28084 has finished for PR 4808 at commit
|
Test FAILed. |
@@ -242,11 +240,12 @@ def __init__(self, elementType, containsNull=True): | |||
:param elementType: the data type of elements. | |||
:param containsNull: indicates whether the list contains None values. | |||
|
|||
>>> ArrayType(StringType) == ArrayType(StringType, True) | |||
>>> ArrayType(StringType()) == ArrayType(StringType(), True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking API change? Or were the old doctests showing incorrect usage of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old tests are incorrect.
It looks like |
@JoshRosen Because we serialized the objects in batch, and pickle memorize the multiple occurrences of same object in the batch, finally we will get single DataType object (even for StructType), we can benefits from this optimization, no |
Test build #28094 has started for PR 4808 at commit
|
Makes sense; LGTM. I'll take a look at the backport patches, too. |
Test build #28094 has finished for PR 4808 at commit
|
Test PASSed. |
The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). This PR also improve the performance of inferSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu <davies@databricks.com> Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Davies Liu] fix tests d9ae973 [Davies Liu] fix memory leak in sql (cherry picked from commit e0e64ba) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
LGTM, so I've merged this into |
The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).
cc @pwendell @JoshRosen