-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5893][ML] Add bucketizer #5980
Conversation
Merged build triggered. |
Merged build started. |
Test build #32110 has started for PR 5980 at commit |
@jkbradley, I will push the refactored #5779 as soon as this PR merged. |
Merged build finished. Test FAILed. |
Test FAILed. |
i'll retrigger this build once jenkins is back up... |
Merged build triggered. |
I change the |
Merged build started. |
jenkins, test this please |
Test build #32114 has started for PR 5980 at commit |
Merged build triggered. |
Merged build started. |
Test build #32115 has started for PR 5980 at commit |
Test build #32114 has finished for PR 5980 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Test build #32115 has finished for PR 5980 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
* `Bucketizer` maps a column of continuous features to a column of feature buckets. | ||
*/ | ||
@AlphaComponent | ||
final class Bucketizer(override val parent: Estimator[Bucketizer] = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use 2 constructors (1 with an argument, 1 without) to be Java-friendly. Also, can we make the constructor with 1 argument be private[ml]
for now?
Test PASSed. |
@yinxusen @jkbradley Instead of having three parameters to control the buckets, is it simpler to let the users provide the boundaries directly? For example, |
@mengxr How to represent val inf = Double.MaxValue
val negInf = Double.MinValue
splits = Array(negInf, 0, 1, 2, 3, inf) Or we can define val inf = new DoubleParam(this, "inf", "")
setDefault(inf -> Double.MaxValue)
def getInf = $(inf)
val negInf = new DoubleParam(this, "negInf", "")
setDefault(negInf -> Double.MinValue)
def getNegInf = $(negInf) However, it would break the meaning of |
|
@mengxr I thought about this, but here's the comment I posted above:
|
Sorry that I didn't see the hidden GitHub comments. I think it depends on the use cases of bucketizer. My guess is that in most cases users use bucketizer on values that they already know the range. For example, time -> hourOfDay, dayOfMonth, or age -> 10-year buckets. In those cases, accepting |
It's hard to say for use cases. It's true empty buckets might be a problem. I'm OK with requiring explicit buckets for -inf, inf; let's just do that, but we'll have to make sure we document that in the Scala doc + examples. For values out of range, there are several behaviors the user might want:
We could eventually support all of these via a new parameter, but for now, shall we choose 1? I suppose I'd vote for putting values in the closest bucket (which will mean the first and last split values will be changed to -inf, inf under the hood). |
Yes, we need document all cases. I kinda like 3 (throw an error) as the default, because it is the most strict. If there is an exception, users should decide to switch to 1 or 2 by changing the bucket boundaries. |
@yinxusen OK, so I think we've converged. Can you please update it to take splits, but no "lowerInclusive, upperInclusive" parameters? Let's start with 3 (throw an error) for handling bad values. It will be great if the driver catches the SparkException and prints a more useful error message, recommending the user set the bin boundaries to -inf, inf. A later PR can add an extra parameter for supporting all 3 behaviors. |
@mengxr @jkbradley I see. I will be soon, say, in 2 days. It's really a little busy here recently. Sorry for the delay. |
@yinxusen I'm worried about this missing 1.4. Would you mind if I sent a PR to you to update this PR? |
@jkbradley Yes, thanks! Sorry for the delay again. |
…ed splits instead.
Removed lowerInclusive, upperInclusive params from Bucketizer
Merged build triggered. |
Merged build started. |
Test build #32430 has started for PR 5980 at commit |
@jkbradley I have merged your PR. Thanks! |
@yinxusen Thanks! I'll merge once tests pass |
Test build #32430 has finished for PR 5980 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Merging into master and branch-1.4. @yinxusen Thanks! |
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer (cherry picked from commit 35fb42a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer
JIRA issue here.
One thing to make clear, the
buckets
parameter, which is an array ofDouble
, performs as split points. Say,splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.