[SPARK-5890][ML] Add feature discretizer #5779

yinxusen · 2015-04-29T14:09:02Z

JIRA issue here.

I borrow the code of findSplits from RandomForest. I don't think it's good to call it from RandomForest directly.

SparkQA · 2015-04-29T15:46:50Z

Test build #31281 has finished for PR 5779 at commit f6be730.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FeatureDiscretizer extends Transformer with HasInputCol with HasOutputCol
This patch does not change any dependencies.

jkbradley · 2015-05-05T17:54:03Z

FeatureDiscretizer should produce a Bucketizer as its Model. Could you please refactor this accordingly? I'd be happy to help as needed, so please let me know.

Alternatively, we could do a PR for Bucketizer first and then update this PR to use the Bucketizer.

yinxusen · 2015-05-05T23:43:35Z

@jkbradley Sure, I'll refactor it ASAP.

jkbradley · 2015-05-06T03:52:30Z

Thanks!

jkbradley · 2015-05-06T18:43:58Z

@yinxusen It will be great if we can get this in by tomorrow. If it would be helpful, I'd be happy to do part, such as writing the Bucketizer PR.

Also, can we rename this to QuantileDiscretizer? There will likely be other feature discretization methods added in the future. Thanks!

yinxusen · 2015-05-07T07:45:52Z

I am writing the Bucketizer now. I will refactor this PR with Bucketizer soon afterwards. You can check it in tomorrow, I think. Sorry for the delay, because I just finished my vacation.

jkbradley · 2015-05-07T08:13:48Z

@yinxusen No problem, I know it's last minute & appreciate your efforts! I'll check tomorrow morning.

SparkQA · 2015-05-15T09:13:41Z

Test build #32810 has finished for PR 5779 at commit 5ffa167.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

jkbradley · 2015-05-18T17:47:14Z

Reviewing now

jkbradley · 2015-05-18T18:38:12Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

+private[feature] trait QuantileDiscretizerBase extends Params with HasInputCol with HasOutputCol {
+
+  /**
+   * Number of buckets to collect data points, which should be a positive integer.


Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must be >= 2. (Please copy to IntParam doc below as well.)

SparkQA · 2015-05-18T19:37:04Z

Test build #820 has finished for PR 5779 at commit 5ffa167.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

jkbradley · 2015-08-18T01:00:45Z

Ping. It'd be nice to get this into 1.6

jkbradley · 2015-09-09T23:21:56Z

Checking for updates here too : )

jkbradley · 2015-09-22T17:37:39Z

@yinxusen If you won't be able to keep working on this, please let me know. Someone else or I can take over. Thanks.

yinxusen · 2015-09-22T18:19:26Z

@jkbradley No, I'll keep work on this, and sorry for the delay.

SparkQA · 2015-09-23T05:05:39Z

Test build #42886 has finished for PR 5779 at commit 99d46ee.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

SparkQA · 2015-09-23T06:17:28Z

Test build #42889 has finished for PR 5779 at commit 954d7b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

jkbradley · 2015-09-24T00:04:40Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

+  /**
+   * Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must
+   * be >= 2.
+   * @group param


State default value here (in Scala doc only, not IntParam doc)

jkbradley · 2015-09-24T00:04:53Z

@yinxusen Thanks for the updates!

SparkQA · 2015-09-24T22:17:43Z

Test build #42983 has finished for PR 5779 at commit 0e807bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

jkbradley · 2015-10-01T20:30:08Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.Logging


organize imports

jkbradley · 2015-10-01T20:30:58Z

I just made minor comments. It looks good, though I'm curious about whether that test caught bugs in the previous version of this PR.

yinxusen · 2015-10-01T21:25:22Z

Yes, it can catch the previous bug. In the previous setting, the check

checkDiscretizedData(sc,
      Array[Double](1, 2, 3, 3, 3, 3, 3, 3, 3),
      3,
      Array[Double](0, 1, 2, 2, 2, 2, 2, 2, 2),
      Array("-Infinity, 2.0", "2.0, 3.0", "3.0, Infinity"))

throws error because it misses the split point of 3.0.

SparkQA · 2015-10-01T22:10:39Z

Test build #43159 has finished for PR 5779 at commit 9f878e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class QuantileDiscretizer(override val uid: String)

jkbradley · 2015-10-02T17:18:55Z

OK thanks for confirming! LGTM, merging with master.

add feature discretizer

f6be730

refactor it into an estimator

91e6802

yinxusen mentioned this pull request May 7, 2015

[SPARK-5893][ML] Add bucketizer #5980

Closed

yinxusen added 4 commits May 7, 2015 21:43

Merge branch 'master' into SPARK-5890

c35108e

take Bucketizer as model

38f73f2

Merge branch 'master' into SPARK-5890

7373b98

merge with Bucketizer

5ffa167

jkbradley reviewed May 18, 2015
View reviewed changes

yinxusen added 2 commits September 22, 2015 20:17

fix find splits

74f1b65

add more tests

b2bd98f

yinxusen added 3 commits September 22, 2015 21:31

Merge branch 'master' into SPARK-5890

136a194

merge with current master

7fadccd

fix minor error

99d46ee

fix style error

954d7b9

jkbradley reviewed Sep 24, 2015
View reviewed changes

yinxusen added 2 commits September 24, 2015 13:15

minor changes

b5d90e7

add get splits test

0e807bc

jkbradley reviewed Oct 1, 2015
View reviewed changes

minor changes

9f878e9

asfgit closed this in 23a9448 Oct 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5890][ML] Add feature discretizer #5779

[SPARK-5890][ML] Add feature discretizer #5779

yinxusen commented Apr 29, 2015

SparkQA commented Apr 29, 2015

jkbradley commented May 5, 2015

yinxusen commented May 5, 2015

jkbradley commented May 6, 2015

jkbradley commented May 6, 2015

yinxusen commented May 7, 2015

jkbradley commented May 7, 2015

SparkQA commented May 15, 2015

jkbradley commented May 18, 2015

jkbradley May 18, 2015

SparkQA commented May 18, 2015

jkbradley commented Aug 18, 2015

jkbradley commented Sep 9, 2015

jkbradley commented Sep 22, 2015

yinxusen commented Sep 22, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

jkbradley Sep 24, 2015

jkbradley commented Sep 24, 2015

SparkQA commented Sep 24, 2015

jkbradley Oct 1, 2015

jkbradley commented Oct 1, 2015

yinxusen commented Oct 1, 2015

SparkQA commented Oct 1, 2015

jkbradley commented Oct 2, 2015


		package org.apache.spark.ml.feature

		import org.apache.spark.Logging

[SPARK-5890][ML] Add feature discretizer #5779

[SPARK-5890][ML] Add feature discretizer #5779

Conversation

yinxusen commented Apr 29, 2015

SparkQA commented Apr 29, 2015

jkbradley commented May 5, 2015

yinxusen commented May 5, 2015

jkbradley commented May 6, 2015

jkbradley commented May 6, 2015

yinxusen commented May 7, 2015

jkbradley commented May 7, 2015

SparkQA commented May 15, 2015

jkbradley commented May 18, 2015

jkbradley May 18, 2015

Choose a reason for hiding this comment

SparkQA commented May 18, 2015

jkbradley commented Aug 18, 2015

jkbradley commented Sep 9, 2015

jkbradley commented Sep 22, 2015

yinxusen commented Sep 22, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

jkbradley Sep 24, 2015

Choose a reason for hiding this comment

jkbradley commented Sep 24, 2015

SparkQA commented Sep 24, 2015

jkbradley Oct 1, 2015

Choose a reason for hiding this comment

jkbradley commented Oct 1, 2015

yinxusen commented Oct 1, 2015

SparkQA commented Oct 1, 2015

jkbradley commented Oct 2, 2015