Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6055] [PySpark] fix incorrect __eq__ of DataType #4808

Closed
wants to merge 5 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Feb 27, 2015

The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc @pwendell @JoshRosen

@davies
Copy link
Contributor Author

davies commented Feb 27, 2015

This PR works for 1.3+, will create another PR for 1.2 and 1.1

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28053 has started for PR 4808 at commit d9ae973.

  • This patch merges cleanly.

@@ -620,93 +619,6 @@ def _get_hive_ctx(self):
return self._jvm.HiveContext(self._jsc.sc())


def _create_row(fields, values):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicated, also in types.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28053 has finished for PR 4808 at commit d9ae973.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28053/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28072 has started for PR 4808 at commit 46999dc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28072 has finished for PR 4808 at commit 46999dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28072/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28079 has started for PR 4808 at commit 534ac90.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28079 has finished for PR 4808 at commit 534ac90.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28079/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28084 has started for PR 4808 at commit 3da44fc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28084 has finished for PR 4808 at commit 3da44fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28084/
Test FAILed.

@@ -242,11 +240,12 @@ def __init__(self, elementType, containsNull=True):
:param elementType: the data type of elements.
:param containsNull: indicates whether the list contains None values.

>>> ArrayType(StringType) == ArrayType(StringType, True)
>>> ArrayType(StringType()) == ArrayType(StringType(), True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking API change? Or were the old doctests showing incorrect usage of the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old tests are incorrect.

@JoshRosen
Copy link
Contributor

It looks like _restore_object still tries to use DataType instance ids as _cached_cls dictionary keys during unpickling; is this still necessary if the DataTypes aren't singletons after unpickling?

@davies
Copy link
Contributor Author

davies commented Feb 27, 2015

@JoshRosen Because we serialized the objects in batch, and pickle memorize the multiple occurrences of same object in the batch, finally we will get single DataType object (even for StructType), we can benefits from this optimization, no __hash__ and __eq__ for later row in the batch.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28094 has started for PR 4808 at commit 6a322a4.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

Makes sense; LGTM. I'll take a look at the backport patches, too.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28094 has finished for PR 4808 at commit 6a322a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28094/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 28, 2015
The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc pwendell  JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #4808 from davies/leak and squashes the following commits:

6a322a4 [Davies Liu] tests refactor
3da44fc [Davies Liu] fix __eq__ of Singleton
534ac90 [Davies Liu] add more checks
46999dc [Davies Liu] fix tests
d9ae973 [Davies Liu] fix memory leak in sql

(cherry picked from commit e0e64ba)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in e0e64ba Feb 28, 2015
@JoshRosen
Copy link
Contributor

LGTM, so I've merged this into branch-1.3 (1.3.0) and master (1.4.0). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants