Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7156][SQL] add randomSplit to DataFrame. #5711

Closed
wants to merge 2 commits into from

Conversation

kaka1992
Copy link
Contributor

SPARK-7156 add randomSplit to DataFrame.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
this.sqlContext.createDataFrame(new PartitionwiseSampledRDD[Row, Row](
rdd, new BernoulliCellSampler[Row](x(0), x(1)), true, seed), schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually breaks the plan -- can we create a logical operator (or generalizes the existing Sample operator) so the returned DataFrame correctly preserves the logical plan?

@rxin
Copy link
Contributor

rxin commented Apr 28, 2015

Thanks for working on this, @kaka1992. Would be great if we can do it in a way that doesn't break the existing logical plan for data frames.

@kaka1992
Copy link
Contributor Author

Can I add InMemoryRelation upon the base logicalPlan? Then I could create several randomSplit plans with the same data. @rxin I'm not sure if this way would break something.

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

#5761

Somebody else submitted a PR based on your change and my review feedback.

@rxin
Copy link
Contributor

rxin commented Apr 29, 2015

@kaka1992 mind closing the pr since #5761 subsumes this?

@kaka1992
Copy link
Contributor Author

@rxin No problem. I'll close the pr.

@kaka1992 kaka1992 closed this Apr 29, 2015
asfgit pushed a commit that referenced this pull request Apr 29, 2015
This is built on top of kaka1992 's PR #5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This is built on top of kaka1992 's PR apache#5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants