[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

jkbradley · 2014-11-24T08:47:46Z

This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml).

Please comment on:
- abstractions, class hierarchy
- functionality required by each abstraction
- naming of types and methods
- ease of use for developers
- ease of use for users migrating from org.apache.spark.mllib
Please ignore for now:
- missing tests and examples
- private/public API (I will make more things private[ml] after writing tests and examples.)
- style and other details
- the many TODO items noted in the code

Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some discussion on design, and this design doc for major design decisions.

This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork.

I have included a limited number of concrete classes in this WIP PR, for purposes of illustration:

LogisticRegression (edited, to show effects of abstract classes)
NaiveBayes (simple to show ease of use for developers)
AdaBoost (demonstration of meta-algorithms taking advantage of abstractions)
- (Note discussion of strong vs. weak types for ensemble methods in design doc.)
- This implementation is very incomplete but illustrates using the abstractions.
LinearRegression (example of Regressor, for completeness)
evaluators (to provide default evaluators in the class hierarchy)
IterativeSolver and IterativeEstimator (to expose iterative algorithms)
LabeledPoint (Q: Should this include an instance weight?)

Items remaining:

helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights)
several TODO items noted in the code
add tests and examples
general cleanup
make more of hierarchy private[ml]
split into several smaller PRs

General plan for splitting into multiple PRs, in order:

Simple class hierarchy
Evaluators
IterativeEstimator
AdaBoost
NaiveBayes (Any time after Evaluators)

Thanks to @epahomov and @BigCrunsh for input, including from [https://github.com//pull/2137] which improves upon the org.apache.spark.mllib APIs.

CC: @etrain @shivaram @mengxr

Abstract classes for learning algorithms: * Classifier * Regressor * Predictor Traits for learning algorithms * HasDefaultEstimator * IterativeEstimator * IterativeSolver * ProbabilisticClassificationModel * WeakLearner Concrete classes: learning algorithms * AdaBoost (partly implemented) * NaiveBayes (rough implementation) * LinearRegression * LogisticRegression (updated to use new abstract classes) Concrete classes: evaluation * ClassificationEvaluator * RegressionEvaluator * PredictionEvaluator Concrete classes: other * LabeledPoint (adding weight to the old LabeledPoint)

SparkQA · 2014-11-24T08:55:03Z

Test build #23780 has started for PR 3427 at commit 79f9fbc.

This patch merges cleanly.

SparkQA · 2014-11-24T08:55:15Z

Test build #23780 has finished for PR 3427 at commit 79f9fbc.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LabeledPoint(label: Double, features: Vector, weight: Double)
- class NaiveBayes extends Classifier[NaiveBayes, NaiveBayesModel] with NaiveBayesParams
- class ClassificationEvaluator extends PredictionEvaluator
- class RegressionEvaluator extends PredictionEvaluator
- trait HasDefaultEvaluator
- trait IterativeEstimator[M <: Model[M]]
- abstract class IterativeSolver[M <: Model[M]]
- trait WeakLearner[M <: PredictionModel[M]]

AmplabJenkins · 2014-11-24T08:55:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23780/
Test FAILed.

jkbradley · 2014-12-08T20:02:07Z

I just submitted the first part of this PR: [https://github.com//pull/3637]

jkbradley · 2014-12-16T01:44:48Z

I'm closing this since I've begun breaking it into smaller PRs. I copied the PR description to the JIRA and will leave my WIP branch intact.

jkbradley added 2 commits November 23, 2014 23:19

fixed compilation issues, but have not added tests yet

79f9fbc

jkbradley mentioned this pull request Dec 8, 2014

[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637

Closed

jkbradley closed this Dec 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

jkbradley commented Nov 24, 2014

SparkQA commented Nov 24, 2014

SparkQA commented Nov 24, 2014

AmplabJenkins commented Nov 24, 2014

jkbradley commented Dec 8, 2014

jkbradley commented Dec 16, 2014

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

Conversation

jkbradley commented Nov 24, 2014

SparkQA commented Nov 24, 2014

SparkQA commented Nov 24, 2014

AmplabJenkins commented Nov 24, 2014

jkbradley commented Dec 8, 2014

jkbradley commented Dec 16, 2014