Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7156][SQL] support RandomSplit in DataFrames #5761

Closed
wants to merge 8 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Apr 28, 2015

This is built on top of @kaka1992 's PR #5711 using Logical plans.

case class Sample(fraction: Double, withReplacement: Boolean, seed: Long, child: LogicalPlan)
extends UnaryNode {
case class Sample(
lb: Double,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowerBound, upperBound and document this

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31186 has finished for PR 5761 at commit 2384266.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

LGTM.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31184 has finished for PR 5761 at commit e98ebac.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@kaka1992
Copy link
Contributor

Hi, is there another way to avoid executing the original dataframe repeatly?

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31193 has finished for PR 5761 at commit f400ade.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

@kaka1992 you'd need to cache it.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31226 has finished for PR 5761 at commit 3c11d1b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

Oh we need to add this to Python also.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31306 has finished for PR 5761 at commit 6000328.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31317 has finished for PR 5761 at commit a1fb0aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sample(
    • case class Sample(
  • This patch does not change any dependencies.

@@ -433,6 +433,22 @@ def sample(self, withReplacement, fraction, seed=None):
rdd = self._jdf.sample(withReplacement, fraction, long(seed))
return DataFrame(rdd, self.sql_ctx)

def randomSplit(self, weights, seed=None):
"""Randomly splits this :class:`DataFrame` with the provided weights.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to add params doc

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

Can you submit a followup PR to address the 3 minor comments?

@asfgit asfgit closed this in d7dbce8 Apr 29, 2015
asfgit pushed a commit that referenced this pull request Apr 30, 2015
small fixes regarding comments in PR #5761

cc rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5795 from brkyvz/split-followup and squashes the following commits:

369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
small fixes regarding comments in PR apache#5761

cc rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5795 from brkyvz/split-followup and squashes the following commits:

369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
small fixes regarding comments in PR apache#5761

cc rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5795 from brkyvz/split-followup and squashes the following commits:

369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
small fixes regarding comments in PR apache#5761

cc rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5795 from brkyvz/split-followup and squashes the following commits:

369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
@brkyvz brkyvz deleted the random-sample branch February 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants