[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` #3764

jbencook · 2014-12-22T16:50:30Z

This PR modifies the python SchemaRDD to use sample() and takeSample() from Scala instead of the slower python implementations from rdd.py. This is worthwhile because the Row's are already serialized as Java objects.

In order to use the faster takeSample(), a takeSampleToPython() method was implemented in SchemaRDD.scala following the pattern of collectToPython().

…and `takeSample()`

AmplabJenkins · 2014-12-22T16:52:11Z

Can one of the admins verify this patch?

marmbrus · 2014-12-22T19:00:34Z

ok to test

SparkQA · 2014-12-22T19:02:33Z

Test build #24706 has started for PR 3764 at commit 020cbdf.

This patch merges cleanly.

SparkQA · 2014-12-22T20:51:59Z

Test build #24706 has finished for PR 3764 at commit 020cbdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-22T20:52:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24706/
Test PASSed.

pwendell · 2014-12-22T21:16:05Z

@davies can you take a look?

davies · 2014-12-22T23:26:37Z

python/pyspark/sql.py

+        """
+        assert fraction >= 0.0, "Negative fraction value: %s" % fraction
+        seed = seed if seed is not None else random.randint(0, sys.maxint)
+        rdd = self._jschema_rdd.baseSchemaRDD().sample(


Could you add sample() for JavaSchemaRDD()? Then this line could be changed to use self._jschema_rdd.sample()

davies · 2014-12-22T23:45:52Z

@jbencook LGTM, just one minor comment.

@marmbrus There are few APIs missing in JavaSchemaRDD (such as sample(), randomSplit()).

jbencook · 2014-12-23T00:21:22Z

Thanks for the comment @davies. Do you want me to add the sample() method to the JavaSchemaRDD in this PR? Or make a ticket for both sample() and randomSplit()?

davies · 2014-12-23T00:42:29Z

@jbencook Maybe we could just add sample() in this PR, leave others later.

SparkQA · 2014-12-23T02:17:28Z

Test build #24720 has started for PR 3764 at commit de22f70.

This patch merges cleanly.

jbencook · 2014-12-23T02:19:23Z

OK @davies, should be good to go now.

SparkQA · 2014-12-23T04:10:24Z

Test build #24720 has finished for PR 3764 at commit de22f70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-23T04:10:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24720/
Test PASSed.

davies · 2014-12-23T06:07:22Z

sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala

@@ -218,4 +218,10 @@ class JavaSchemaRDD(
   */
  def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
    this.baseSchemaRDD.subtract(other.baseSchemaRDD, p).toJavaSchemaRDD
+
+  /**
+   * Return an RDD with a sampled version of the underlying dataset.


Return a SchemaRDD

davies · 2014-12-23T06:08:24Z

LGTM, just two minor comments. After fixing them, I think it's ready to merge.

…Python() arguments

SparkQA · 2014-12-23T09:47:30Z

Test build #24737 has started for PR 3764 at commit 6fbc769.

This patch merges cleanly.

SparkQA · 2014-12-23T11:29:16Z

Test build #24737 has finished for PR 3764 at commit 6fbc769.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-23T11:29:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24737/
Test FAILed.

jbencook · 2014-12-23T11:35:02Z

Oops- shouldn't have tried to fix this before I had my coffee. Will fix in a bit.

davies · 2014-12-23T16:59:02Z

Jenkins, test this please.

SparkQA · 2014-12-23T17:02:24Z

Test build #24740 has started for PR 3764 at commit 6fbc769.

This patch merges cleanly.

SparkQA · 2014-12-23T18:43:52Z

Test build #24740 has finished for PR 3764 at commit 6fbc769.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-23T18:43:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24740/
Test FAILed.

jbencook · 2014-12-23T22:29:22Z

@davies I can't seem to replicate this failure on my machine. Is there any chance these tests are timing out non-deterministically? Can you think of any reason why the indentation would cause this?

Looks like everything was pretty quick in the successful builds, e.g. #24720:

[info] CliSuite:
[info] - Simple commands (29 seconds, 879 milliseconds)
[info] - Single command with -e (22 seconds, 480 milliseconds)
[info] HiveThriftServer2Suite:
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Test JDBC query execution (31 seconds, 4 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Test JDBC query execution in Http Mode (29 seconds, 606 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-3004 regression: result set containing NULL (28 seconds, 833 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - GetInfo Thrift API (26 seconds, 231 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Checks Hive version (26 seconds, 886 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Checks Hive version in Http Mode (27 seconds, 400 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4292 regression: result set iterator issue (31 seconds, 443 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4309 regression: Date type support (26 seconds, 997 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4407 regression: Complex type support (27 seconds, 714 milliseconds)

SparkQA · 2014-12-23T22:40:25Z

Test build #554 has started for PR 3764 at commit 6fbc769.

This patch merges cleanly.

davies · 2014-12-23T22:40:40Z

@jbencook Maybe this case is flaky, let's test it again.

SparkQA · 2014-12-24T00:31:48Z

Test build #554 has finished for PR 3764 at commit 6fbc769.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-12-24T01:46:05Z

This looks good to me, too, so I'm going to merge it into master (1.3.0). Thanks!

[SPARK-4860][pyspark][sql] using Scala implementations of sample() …

020cbdf

…and `takeSample()`

davies reviewed Dec 22, 2014
View reviewed changes

jbencook added 2 commits December 22, 2014 20:15

[SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD

b916442

[SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD

de22f70

davies reviewed Dec 23, 2014
View reviewed changes

J. Benjamin Cook added 2 commits December 23, 2014 03:42

[SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD

5170da2

[SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleTo…

6fbc769

…Python() arguments

asfgit closed this in fd41eb9 Dec 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` #3764

[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` #3764

jbencook commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

marmbrus commented Dec 22, 2014

SparkQA commented Dec 22, 2014

SparkQA commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

pwendell commented Dec 22, 2014

davies Dec 22, 2014

davies commented Dec 22, 2014

jbencook commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

jbencook commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

davies Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

jbencook commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

jbencook commented Dec 23, 2014

SparkQA commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 24, 2014

JoshRosen commented Dec 24, 2014

[SPARK-4860][pyspark][sql] speeding up sample() and takeSample() #3764

[SPARK-4860][pyspark][sql] speeding up sample() and takeSample() #3764

Conversation

jbencook commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

marmbrus commented Dec 22, 2014

SparkQA commented Dec 22, 2014

SparkQA commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

pwendell commented Dec 22, 2014

davies Dec 22, 2014

Choose a reason for hiding this comment

davies commented Dec 22, 2014

jbencook commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

jbencook commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

davies Dec 23, 2014

Choose a reason for hiding this comment

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

jbencook commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 23, 2014

SparkQA commented Dec 23, 2014

AmplabJenkins commented Dec 23, 2014

jbencook commented Dec 23, 2014

SparkQA commented Dec 23, 2014

davies commented Dec 23, 2014

SparkQA commented Dec 24, 2014

JoshRosen commented Dec 24, 2014

[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` #3764

[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` #3764