[SPARK-6055] [PySpark] fix incorrect eq of DataType #4808

davies · 2015-02-27T07:31:14Z

The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc @pwendell @JoshRosen

davies · 2015-02-27T07:31:34Z

This PR works for 1.3+, will create another PR for 1.2 and 1.1

SparkQA · 2015-02-27T07:32:45Z

Test build #28053 has started for PR 4808 at commit d9ae973.

This patch merges cleanly.

davies · 2015-02-27T07:35:57Z

python/pyspark/sql/context.py

@@ -620,93 +619,6 @@ def _get_hive_ctx(self):
        return self._jvm.HiveContext(self._jsc.sc())


-def _create_row(fields, values):


These are duplicated, also in types.py.

Yep, good catch.

SparkQA · 2015-02-27T08:28:15Z

Test build #28053 has finished for PR 4808 at commit d9ae973.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T08:28:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28053/
Test FAILed.

SparkQA · 2015-02-27T16:37:37Z

Test build #28072 has started for PR 4808 at commit 46999dc.

This patch merges cleanly.

SparkQA · 2015-02-27T17:55:42Z

Test build #28072 has finished for PR 4808 at commit 46999dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T17:55:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28072/
Test PASSed.

SparkQA · 2015-02-27T18:28:02Z

Test build #28079 has started for PR 4808 at commit 534ac90.

This patch merges cleanly.

SparkQA · 2015-02-27T19:29:41Z

Test build #28079 has finished for PR 4808 at commit 534ac90.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T19:29:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28079/
Test FAILed.

SparkQA · 2015-02-27T20:17:43Z

Test build #28084 has started for PR 4808 at commit 3da44fc.

This patch merges cleanly.

SparkQA · 2015-02-27T21:13:02Z

Test build #28084 has finished for PR 4808 at commit 3da44fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T21:13:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28084/
Test FAILed.

JoshRosen · 2015-02-27T21:16:40Z

python/pyspark/sql/types.py

@@ -242,11 +240,12 @@ def __init__(self, elementType, containsNull=True):
        :param elementType: the data type of elements.
        :param containsNull: indicates whether the list contains None values.

-        >>> ArrayType(StringType) == ArrayType(StringType, True)
+        >>> ArrayType(StringType()) == ArrayType(StringType(), True)


Is this a breaking API change? Or were the old doctests showing incorrect usage of the API?

Old tests are incorrect.

JoshRosen · 2015-02-27T21:41:54Z

It looks like _restore_object still tries to use DataType instance ids as _cached_cls dictionary keys during unpickling; is this still necessary if the DataTypes aren't singletons after unpickling?

davies · 2015-02-27T21:50:43Z

@JoshRosen Because we serialized the objects in batch, and pickle memorize the multiple occurrences of same object in the batch, finally we will get single DataType object (even for StructType), we can benefits from this optimization, no __hash__ and __eq__ for later row in the batch.

SparkQA · 2015-02-27T21:52:43Z

Test build #28094 has started for PR 4808 at commit 6a322a4.

This patch merges cleanly.

JoshRosen · 2015-02-27T21:53:05Z

Makes sense; LGTM. I'll take a look at the backport patches, too.

SparkQA · 2015-02-27T23:09:15Z

Test build #28094 has finished for PR 4808 at commit 6a322a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T23:09:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28094/
Test PASSed.

The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). This PR also improve the performance of inferSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu <davies@databricks.com> Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Davies Liu] fix tests d9ae973 [Davies Liu] fix memory leak in sql (cherry picked from commit e0e64ba) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

JoshRosen · 2015-02-28T05:03:24Z

LGTM, so I've merged this into branch-1.3 (1.3.0) and master (1.4.0). Thanks!

fix memory leak in sql

d9ae973

davies reviewed Feb 27, 2015
View reviewed changes

fix tests

46999dc

add more checks

534ac90

fix __eq__ of Singleton

3da44fc

JoshRosen reviewed Feb 27, 2015
View reviewed changes

tests refactor

6a322a4

asfgit closed this in e0e64ba Feb 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6055] [PySpark] fix incorrect eq of DataType #4808

[SPARK-6055] [PySpark] fix incorrect eq of DataType #4808

davies commented Feb 27, 2015

davies commented Feb 27, 2015

SparkQA commented Feb 27, 2015

davies Feb 27, 2015

JoshRosen Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

JoshRosen Feb 27, 2015

davies Feb 27, 2015

JoshRosen commented Feb 27, 2015

davies commented Feb 27, 2015

SparkQA commented Feb 27, 2015

JoshRosen commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

JoshRosen commented Feb 28, 2015

		@@ -620,93 +619,6 @@ def _get_hive_ctx(self):
		return self._jvm.HiveContext(self._jsc.sc())


		def _create_row(fields, values):

[SPARK-6055] [PySpark] fix incorrect __eq__ of DataType #4808

[SPARK-6055] [PySpark] fix incorrect __eq__ of DataType #4808

Conversation

davies commented Feb 27, 2015

davies commented Feb 27, 2015

SparkQA commented Feb 27, 2015

davies Feb 27, 2015

Choose a reason for hiding this comment

JoshRosen Feb 27, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

JoshRosen Feb 27, 2015

Choose a reason for hiding this comment

davies Feb 27, 2015

Choose a reason for hiding this comment

JoshRosen commented Feb 27, 2015

davies commented Feb 27, 2015

SparkQA commented Feb 27, 2015

JoshRosen commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

JoshRosen commented Feb 28, 2015

[SPARK-6055] [PySpark] fix incorrect eq of DataType #4808

[SPARK-6055] [PySpark] fix incorrect eq of DataType #4808