Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

Closed
wants to merge 2 commits into from

Conversation

jkbradley
Copy link
Member

This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml).

  • Please comment on:
    • abstractions, class hierarchy
    • functionality required by each abstraction
    • naming of types and methods
    • ease of use for developers
    • ease of use for users migrating from org.apache.spark.mllib
  • Please ignore for now:
    • missing tests and examples
    • private/public API (I will make more things private[ml] after writing tests and examples.)
    • style and other details
    • the many TODO items noted in the code

Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some discussion on design, and this design doc for major design decisions.

This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork.

I have included a limited number of concrete classes in this WIP PR, for purposes of illustration:

  • LogisticRegression (edited, to show effects of abstract classes)
  • NaiveBayes (simple to show ease of use for developers)
  • AdaBoost (demonstration of meta-algorithms taking advantage of abstractions)
    • (Note discussion of strong vs. weak types for ensemble methods in design doc.)
    • This implementation is very incomplete but illustrates using the abstractions.
  • LinearRegression (example of Regressor, for completeness)
  • evaluators (to provide default evaluators in the class hierarchy)
  • IterativeSolver and IterativeEstimator (to expose iterative algorithms)
  • LabeledPoint (Q: Should this include an instance weight?)

Items remaining:

  • helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights)
  • several TODO items noted in the code
  • add tests and examples
  • general cleanup
  • make more of hierarchy private[ml]
  • split into several smaller PRs

General plan for splitting into multiple PRs, in order:

  1. Simple class hierarchy
  2. Evaluators
  3. IterativeEstimator
  4. AdaBoost
  5. NaiveBayes (Any time after Evaluators)

Thanks to @epahomov and @BigCrunsh for input, including from [https://github.com//pull/2137] which improves upon the org.apache.spark.mllib APIs.

CC: @etrain @shivaram @mengxr

Abstract classes for learning algorithms:
* Classifier
* Regressor
* Predictor

Traits for learning algorithms
* HasDefaultEstimator
* IterativeEstimator
* IterativeSolver
* ProbabilisticClassificationModel
* WeakLearner

Concrete classes: learning algorithms
* AdaBoost (partly implemented)
* NaiveBayes (rough implementation)
* LinearRegression
* LogisticRegression (updated to use new abstract classes)

Concrete classes: evaluation
* ClassificationEvaluator
* RegressionEvaluator
* PredictionEvaluator

Concrete classes: other
* LabeledPoint (adding weight to the old LabeledPoint)
@SparkQA
Copy link

SparkQA commented Nov 24, 2014

Test build #23780 has started for PR 3427 at commit 79f9fbc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 24, 2014

Test build #23780 has finished for PR 3427 at commit 79f9fbc.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LabeledPoint(label: Double, features: Vector, weight: Double)
    • class NaiveBayes extends Classifier[NaiveBayes, NaiveBayesModel] with NaiveBayesParams
    • class ClassificationEvaluator extends PredictionEvaluator
    • class RegressionEvaluator extends PredictionEvaluator
    • trait HasDefaultEvaluator
    • trait IterativeEstimator[M <: Model[M]]
    • abstract class IterativeSolver[M <: Model[M]]
    • trait WeakLearner[M <: PredictionModel[M]]

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23780/
Test FAILed.

@jkbradley
Copy link
Member Author

I just submitted the first part of this PR: [https://github.com//pull/3637]

@jkbradley
Copy link
Member Author

I'm closing this since I've begun breaking it into smaller PRs. I copied the PR description to the JIRA and will leave my WIP branch intact.

@jkbradley jkbradley closed this Dec 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants