-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4860][pyspark][sql] speeding up sample()
and takeSample()
#3764
Conversation
…and `takeSample()`
Can one of the admins verify this patch? |
ok to test |
Test build #24706 has started for PR 3764 at commit
|
Test build #24706 has finished for PR 3764 at commit
|
Test PASSed. |
@davies can you take a look? |
""" | ||
assert fraction >= 0.0, "Negative fraction value: %s" % fraction | ||
seed = seed if seed is not None else random.randint(0, sys.maxint) | ||
rdd = self._jschema_rdd.baseSchemaRDD().sample( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add sample() for JavaSchemaRDD()? Then this line could be changed to use self._jschema_rdd.sample()
Thanks for the comment @davies. Do you want me to add the sample() method to the JavaSchemaRDD in this PR? Or make a ticket for both sample() and randomSplit()? |
@jbencook Maybe we could just add sample() in this PR, leave others later. |
Test build #24720 has started for PR 3764 at commit
|
OK @davies, should be good to go now. |
Test build #24720 has finished for PR 3764 at commit
|
Test PASSed. |
@@ -218,4 +218,10 @@ class JavaSchemaRDD( | |||
*/ | |||
def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD = | |||
this.baseSchemaRDD.subtract(other.baseSchemaRDD, p).toJavaSchemaRDD | |||
|
|||
/** | |||
* Return an RDD with a sampled version of the underlying dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return a SchemaRDD
LGTM, just two minor comments. After fixing them, I think it's ready to merge. |
Test build #24737 has started for PR 3764 at commit
|
Test build #24737 has finished for PR 3764 at commit
|
Test FAILed. |
Oops- shouldn't have tried to fix this before I had my coffee. Will fix in a bit. |
Jenkins, test this please. |
Test build #24740 has started for PR 3764 at commit
|
Test build #24740 has finished for PR 3764 at commit
|
Test FAILed. |
@davies I can't seem to replicate this failure on my machine. Is there any chance these tests are timing out non-deterministically? Can you think of any reason why the indentation would cause this? Looks like everything was pretty quick in the successful builds, e.g. #24720:
|
Test build #554 has started for PR 3764 at commit
|
@jbencook Maybe this case is flaky, let's test it again. |
Test build #554 has finished for PR 3764 at commit
|
This looks good to me, too, so I'm going to merge it into |
This PR modifies the python
SchemaRDD
to usesample()
andtakeSample()
from Scala instead of the slower python implementations fromrdd.py
. This is worthwhile because theRow
's are already serialized as Java objects.In order to use the faster
takeSample()
, atakeSampleToPython()
method was implemented inSchemaRDD.scala
following the pattern ofcollectToPython()
.