Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4860][pyspark][sql] speeding up sample() and takeSample() #3764

Closed
wants to merge 5 commits into from
Closed

[SPARK-4860][pyspark][sql] speeding up sample() and takeSample() #3764

wants to merge 5 commits into from

Conversation

jbencook
Copy link

This PR modifies the python SchemaRDD to use sample() and takeSample() from Scala instead of the slower python implementations from rdd.py. This is worthwhile because the Row's are already serialized as Java objects.

In order to use the faster takeSample(), a takeSampleToPython() method was implemented in SchemaRDD.scala following the pattern of collectToPython().

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24706 has started for PR 3764 at commit 020cbdf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24706 has finished for PR 3764 at commit 020cbdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24706/
Test PASSed.

@pwendell
Copy link
Contributor

@davies can you take a look?

"""
assert fraction >= 0.0, "Negative fraction value: %s" % fraction
seed = seed if seed is not None else random.randint(0, sys.maxint)
rdd = self._jschema_rdd.baseSchemaRDD().sample(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add sample() for JavaSchemaRDD()? Then this line could be changed to use self._jschema_rdd.sample()

@davies
Copy link
Contributor

davies commented Dec 22, 2014

@jbencook LGTM, just one minor comment.

@marmbrus There are few APIs missing in JavaSchemaRDD (such as sample(), randomSplit()).

@jbencook
Copy link
Author

Thanks for the comment @davies. Do you want me to add the sample() method to the JavaSchemaRDD in this PR? Or make a ticket for both sample() and randomSplit()?

@davies
Copy link
Contributor

davies commented Dec 23, 2014

@jbencook Maybe we could just add sample() in this PR, leave others later.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24720 has started for PR 3764 at commit de22f70.

  • This patch merges cleanly.

@jbencook
Copy link
Author

OK @davies, should be good to go now.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24720 has finished for PR 3764 at commit de22f70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24720/
Test PASSed.

@@ -218,4 +218,10 @@ class JavaSchemaRDD(
*/
def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
this.baseSchemaRDD.subtract(other.baseSchemaRDD, p).toJavaSchemaRDD

/**
* Return an RDD with a sampled version of the underlying dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return a SchemaRDD

@davies
Copy link
Contributor

davies commented Dec 23, 2014

LGTM, just two minor comments. After fixing them, I think it's ready to merge.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24737 has started for PR 3764 at commit 6fbc769.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24737 has finished for PR 3764 at commit 6fbc769.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24737/
Test FAILed.

@jbencook
Copy link
Author

Oops- shouldn't have tried to fix this before I had my coffee. Will fix in a bit.

@davies
Copy link
Contributor

davies commented Dec 23, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24740 has started for PR 3764 at commit 6fbc769.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24740 has finished for PR 3764 at commit 6fbc769.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24740/
Test FAILed.

@jbencook
Copy link
Author

@davies I can't seem to replicate this failure on my machine. Is there any chance these tests are timing out non-deterministically? Can you think of any reason why the indentation would cause this?

Looks like everything was pretty quick in the successful builds, e.g. #24720:

[info] CliSuite:
[info] - Simple commands (29 seconds, 879 milliseconds)
[info] - Single command with -e (22 seconds, 480 milliseconds)
[info] HiveThriftServer2Suite:
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Test JDBC query execution (31 seconds, 4 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Test JDBC query execution in Http Mode (29 seconds, 606 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-3004 regression: result set containing NULL (28 seconds, 833 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - GetInfo Thrift API (26 seconds, 231 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Checks Hive version (26 seconds, 886 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - Checks Hive version in Http Mode (27 seconds, 400 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4292 regression: result set iterator issue (31 seconds, 443 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4309 regression: Date type support (26 seconds, 997 milliseconds)
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
[info] - SPARK-4407 regression: Complex type support (27 seconds, 714 milliseconds)

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #554 has started for PR 3764 at commit 6fbc769.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Dec 23, 2014

@jbencook Maybe this case is flaky, let's test it again.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #554 has finished for PR 3764 at commit 6fbc769.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

This looks good to me, too, so I'm going to merge it into master (1.3.0). Thanks!

@asfgit asfgit closed this in fd41eb9 Dec 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants