Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5890][ML] Add feature discretizer #5779

Closed
wants to merge 15 commits into from

Conversation

yinxusen
Copy link
Contributor

JIRA issue here.

I borrow the code of findSplits from RandomForest. I don't think it's good to call it from RandomForest directly.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31281 has finished for PR 5779 at commit f6be730.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FeatureDiscretizer extends Transformer with HasInputCol with HasOutputCol
  • This patch does not change any dependencies.

@jkbradley
Copy link
Member

FeatureDiscretizer should produce a Bucketizer as its Model. Could you please refactor this accordingly? I'd be happy to help as needed, so please let me know.

Alternatively, we could do a PR for Bucketizer first and then update this PR to use the Bucketizer.

@yinxusen
Copy link
Contributor Author

yinxusen commented May 5, 2015

@jkbradley Sure, I'll refactor it ASAP.

@jkbradley
Copy link
Member

Thanks!

@jkbradley
Copy link
Member

@yinxusen It will be great if we can get this in by tomorrow. If it would be helpful, I'd be happy to do part, such as writing the Bucketizer PR.

Also, can we rename this to QuantileDiscretizer? There will likely be other feature discretization methods added in the future. Thanks!

@yinxusen
Copy link
Contributor Author

yinxusen commented May 7, 2015

I am writing the Bucketizer now. I will refactor this PR with Bucketizer soon afterwards. You can check it in tomorrow, I think. Sorry for the delay, because I just finished my vacation.

@jkbradley
Copy link
Member

@yinxusen No problem, I know it's last minute & appreciate your efforts! I'll check tomorrow morning.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32810 has finished for PR 5779 at commit 5ffa167.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)

@jkbradley
Copy link
Member

Reviewing now

private[feature] trait QuantileDiscretizerBase extends Params with HasInputCol with HasOutputCol {

/**
* Number of buckets to collect data points, which should be a positive integer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must be >= 2. (Please copy to IntParam doc below as well.)

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #820 has finished for PR 5779 at commit 5ffa167.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)

@jkbradley
Copy link
Member

Ping. It'd be nice to get this into 1.6

@jkbradley
Copy link
Member

Checking for updates here too : )

@jkbradley
Copy link
Member

@yinxusen If you won't be able to keep working on this, please let me know. Someone else or I can take over. Thanks.

@yinxusen
Copy link
Contributor Author

@jkbradley No, I'll keep work on this, and sorry for the delay.

@SparkQA
Copy link

SparkQA commented Sep 23, 2015

Test build #42886 has finished for PR 5779 at commit 99d46ee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)

@SparkQA
Copy link

SparkQA commented Sep 23, 2015

Test build #42889 has finished for PR 5779 at commit 954d7b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)

/**
* Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must
* be >= 2.
* @group param
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State default value here (in Scala doc only, not IntParam doc)

@jkbradley
Copy link
Member

@yinxusen Thanks for the updates!

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42983 has finished for PR 5779 at commit 0e807bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)


package org.apache.spark.ml.feature

import org.apache.spark.Logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports

@jkbradley
Copy link
Member

I just made minor comments. It looks good, though I'm curious about whether that test caught bugs in the previous version of this PR.

@yinxusen
Copy link
Contributor Author

yinxusen commented Oct 1, 2015

Yes, it can catch the previous bug. In the previous setting, the check

checkDiscretizedData(sc,
      Array[Double](1, 2, 3, 3, 3, 3, 3, 3, 3),
      3,
      Array[Double](0, 1, 2, 2, 2, 2, 2, 2, 2),
      Array("-Infinity, 2.0", "2.0, 3.0", "3.0, Infinity"))

throws error because it misses the split point of 3.0.

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43159 has finished for PR 5779 at commit 9f878e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class QuantileDiscretizer(override val uid: String)

@jkbradley
Copy link
Member

OK thanks for confirming! LGTM, merging with master.

@asfgit asfgit closed this in 23a9448 Oct 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants