-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5890][ML] Add feature discretizer #5779
Conversation
Test build #31281 has finished for PR 5779 at commit
|
FeatureDiscretizer should produce a Bucketizer as its Model. Could you please refactor this accordingly? I'd be happy to help as needed, so please let me know. Alternatively, we could do a PR for Bucketizer first and then update this PR to use the Bucketizer. |
@jkbradley Sure, I'll refactor it ASAP. |
Thanks! |
@yinxusen It will be great if we can get this in by tomorrow. If it would be helpful, I'd be happy to do part, such as writing the Bucketizer PR. Also, can we rename this to QuantileDiscretizer? There will likely be other feature discretization methods added in the future. Thanks! |
I am writing the |
@yinxusen No problem, I know it's last minute & appreciate your efforts! I'll check tomorrow morning. |
Test build #32810 has finished for PR 5779 at commit
|
Reviewing now |
private[feature] trait QuantileDiscretizerBase extends Params with HasInputCol with HasOutputCol { | ||
|
||
/** | ||
* Number of buckets to collect data points, which should be a positive integer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must be >= 2.
(Please copy to IntParam doc below as well.)
Test build #820 has finished for PR 5779 at commit
|
Ping. It'd be nice to get this into 1.6 |
Checking for updates here too : ) |
@yinxusen If you won't be able to keep working on this, please let me know. Someone else or I can take over. Thanks. |
@jkbradley No, I'll keep work on this, and sorry for the delay. |
Test build #42886 has finished for PR 5779 at commit
|
Test build #42889 has finished for PR 5779 at commit
|
/** | ||
* Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must | ||
* be >= 2. | ||
* @group param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State default value here (in Scala doc only, not IntParam doc)
@yinxusen Thanks for the updates! |
Test build #42983 has finished for PR 5779 at commit
|
|
||
package org.apache.spark.ml.feature | ||
|
||
import org.apache.spark.Logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
organize imports
I just made minor comments. It looks good, though I'm curious about whether that test caught bugs in the previous version of this PR. |
Yes, it can catch the previous bug. In the previous setting, the check checkDiscretizedData(sc,
Array[Double](1, 2, 3, 3, 3, 3, 3, 3, 3),
3,
Array[Double](0, 1, 2, 2, 2, 2, 2, 2, 2),
Array("-Infinity, 2.0", "2.0, 3.0", "3.0, Infinity")) throws error because it misses the split point of 3.0. |
Test build #43159 has finished for PR 5779 at commit
|
OK thanks for confirming! LGTM, merging with master. |
JIRA issue here.
I borrow the code of
findSplits
fromRandomForest
. I don't think it's good to call it fromRandomForest
directly.