Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637

Closed
wants to merge 30 commits into from

Conversation

jkbradley
Copy link
Member

This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

UPDATE: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:

  • new output columns: rawPrediction, probabilities
    • The “score” column is now called “rawPrediction”
  • Classifiers now provide numClasses
  • Params.get and .set are now protected instead of private[ml].
  • ParamMap now has a size method.
  • new classes: LinearRegression, LinearRegressionModel
  • LogisticRegression now has an intercept.

Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):

  • Classifier (+ ClassificationModel)
  • ProbabilisticClassifier (+ ProbabilisticClassificationModel)
  • Regressor (+ RegressionModel)
  • Predictor (+ PredictionModel)
  • For all of these:
    • There is no strongly typed training-time API.
    • There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms

  • LinearRegression
  • LogisticRegression (updated to use new abstract classes)
    • Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:

  • params.scala: Changed Params.set/get to be protected instead of private[ml]
    • This was needed for the example of defining a class from outside of the MLlib namespace.
  • VectorUDT: Will later change from private[spark] to public.
    • This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
    • Also, added equals() method.f
  • SPARK-4942 : ML Transformers should allow output cols to be turned on,off
    • Update validateAndTransformSchema
    • Update transform
  • (Updated examples, test suites according to other changes)

New examples:

  • DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
    • Added Java version too

Test Suites:

  • LinearRegressionSuite
  • LogisticRegressionSuite
  • + Java versions of above suites

CC: @mengxr @etrain @shivaram

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24226 has started for PR 3637 at commit 1e46094.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24226 has finished for PR 3637 at commit 1e46094.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LabeledPoint(label: Double, features: Vector, weight: Double)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24226/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24227 has started for PR 3637 at commit 83109eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24227 has finished for PR 3637 at commit 83109eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LabeledPoint(label: Double, features: Vector, weight: Double)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24227/
Test FAILed.

@jkbradley
Copy link
Member Author

The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121 in schemaFor) where it gets confused if the case class includes multiple constructors. The default behavior should probably be to take the constructor with the most arguments, but I'll consult others about this. This PR may be on temporary hold...but feel free to comment since the Spark SQL fix should not require changing this PR!

*/
@AlphaComponent
@BeanInfo
case class LabeledPoint(label: Double, features: Vector, weight: Double) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a label of LabeledPoint assumed as only Double? I think there are some cases where label is not Double such as one-of-k encoding. It seems better not to restrict to Double type. If I missed some alternatives, sorry for that and please let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion of this in the design doc linked from the JIRA. Basically, there could be a whole range of types, and it's a question of simplicity of the API vs. strong typing. I thought about templatizing LabeledPoint by LabelType and FeaturesType, but it makes developers & users have to write a bunch more boilerplate whenever they specify types. The current plan is to use Double for single labels and eventually some other type (Vector?) for multiple labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also more sympathetic with a strongly-typed API here rather than overload floating-point values to represent unordered categories. Are there really so many possibilities? Any continuous or ordinal value really does naturally translate to a double. Categoricals are the only other type of value that needs a separate representation.

I feel like this misses some opportunities to optimize the internal representation (e.g. a Dataset whose feature is known to be one of N values doesn't need a double, but potentially just N bits) and avoid ambiguities of representation (is negative -1? 0?) This is one of those areas where the 'simple' API just seems to push complexity elsewhere or ignore it. An algorithm either has to have its own checks for whether 1.0 is a category or not, or, overlooks the distinction. Same with the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you about strong types being nice. I think optimizations with using fewer bits should remain internal. For the user-facing API, it's really a question about whether we want to incur some extra overhead with strong types:

  • Users & developers have to write LabeledPoint[Double, Vector] or LabeledPoint[Int, Vector] instead of LabeledPoint.
    • As far as I know, we can't have default type parameters, so users could never write LabeledPoint. (a rare time I miss C++)
  • Algorithm APIs get messier to look at (passing about LabelType and FeaturesType). This is especially annoying with meta-algorithms (boosting & bagging).

Personally, I'd be OK with heavy typing (coming from C++ land), but it might offend some Scala users.

CC: @mengxr since I know you have opinions about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it need not be designed with generic types. In fact it can't really since there are N features of different types. But you can have a Feature class with subclasses for ordered and categorical types. That too has its own tradeoffs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think I'd be OK with the 2nd option above as long as we can come up with simple APIs for users who don't want to think about types. That might involve default conversions between types (like one-hot encoding, and binning).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the former. Yeah, that's probably the downside. Each data element is at least an object, and you can't have it reduce to a double[] under the hood.

In the second example, I think you'd only ever really want MixedFeatures as an abstraction. There's no need to think of all CategoricalFeatures as a special case deserving a unique abstraction.

I suppose if you abstract the entire training example as an object, and allow accessors like getNumericFeature(index: Int), getCategoricalFeature(index: Int) you can still internally optimize the storage while exposing a richer object representation. You get the type safety and optimization opportunity.

Sure, an Array[Double] could easily be translated into one of the more elaborate representations above. I suppose I myself wouldn't want to make it too easy to not think about types!

Anyway, up to your judgment really. There are arguments several ways here. Worth a thought to see if the idea of a bit more abstraction appeals to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main pros & cons I see for having continuous & categorical types for labels & features are:

Pros:

  • Type safety.

Cons:

  • Algorithms may need to make copies of the data, depending on how much we expose internals of a features type.
  • Users may have to worry more about types.
    • For labels, if we load data from a file without metadata (like libsvm), we may need to assume that everything is continuous. Users will have to explicitly cast labels to categorical for classification.
    • For features, strong typing implies a stronger contract, where the assumption is that users specify the correct types. I've been wondering about having more "best effort" APIs, where we take suggestions from users (like DecisionTree's categoricalFeaturesInfo) but otherwise try to infer the best types to use under the hood.

These lists started out much more balanced, but I guess I'm voting for the old system where everything is a Double.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good points. I think there's a "pro" for optimizing storage but that's a bit secondary.

I don't think a caller has to 'translate' 0/1 labels to categorical. These can be labels called 0 and 1. Given schema information, all of this is stuff frameworks can do for you. Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?

So, let's say my feature takes on "foo", "bar", "baz" as values. Doesn't the caller always have to translate to/from numbers? no big deal but is that simpler? I think the schema abstraction is going to help with this, I imagine.

Anyway, not really strongly arguing with you here, just completing the discussion. An array of doubles is kind of raw but certainly does the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For optimizing storage, I think we can do it in either case (Vector for weakly typing, or Array[Double] for strong typing).

I too agree both options are good & just wanted to hash out the pros & cons.

Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?

That's not quite what I meant. I was thinking of 2 use cases:

  1. An expert user specifies types. The algorithm should use those types.
  2. A beginner user does not explicitly specify types, but uses the typed API so that loaded data are given default types (probably continuous). Following the first case, the algorithm should use the given types, but following intuition, the algorithm should infer types.
    This is less of an issue if the typed interface is designated as an expert API.

I'm starting to wonder if the typed interface should be completely public. I'll move this to a non-inline comment so the rest of our discussion does not get hidden in the future.

@jkbradley
Copy link
Member Author

Question: Do people have preferences for the name of what is currently "predictRaw?" Possibilities are:

predictRaw()
predictConfidence()
confidences()

(I did not include predictScore since "score" is already taken by logreg to mean the probability estimate.)

@jkbradley
Copy link
Member Author

@srowen @Lewuathe Continuing the above inline discussion...

Question: Should the typed interface be public?

New proposal: Hide the typed interface of Estimators. Leave the typed interface of Transformers exposed.

Argument:

  • The typed interface loses metadata which SchemaRDD can (but does not yet) store.
    • E.g., for Classifiers, it is good to know the number of classes to predict, which features are categorical, and the number of categories for each categorical feature. The current typed train() methods do not have this info; to pass in this info, we'll need either (a) extra parameters in train() which would make Classifiers have a different signature than other Estimators' train() methods or (b) extra embedded parameters in Classifiers which would be ignored when using the fit(SchemaRDD) interface. Neither option sounds good to me.
    • We could use a typed interface with stronger typing for features, but that would still not cover metadata like # classes / categories.
    • This metadata is important for training, but it is not important for testing. We would just need to make sure that Vectors passed predict() methods had the same feature order as used for training.
  • I would guess the typed interface would be most useful for Models. This is based on me assuming that:
    • Models will be kept for longer and might have predict() methods called multiple times, including on individual instances, and
    • Models might need typed APIs for efficiency if used in production.

What do you think?

@srowen
Copy link
Member

srowen commented Dec 9, 2014

Pardon, which part are you referring to when you say the typed interface? The metadata about what is a categorical feature, what the values are, etc? I assumed that this information is in part derived from what's in the SchemaRDD -- a string must be categorical, for example. The information would live in a Model, either implicitly explicitly: logistic regression always knows its target is categorical; may need to remember how to treat the other features when it receives new data. And to some degree this info would be injected as parameters to an Estimator somehow.

@jkbradley
Copy link
Member Author

Oh, apologies for being unclear. I meant this division:

  • Typed interface: train(RDD[LabeledPoint]), predict(Vector)
  • SchemaRDD interface: fit(SchemaRDD), transform(SchemaRDD)

I agree metadata is less of an issue for Models since they can get metadata from their parent Estimator. The main issue is specifying metadata for Estimators.

@shivaram
Copy link
Contributor

@jkbradley Apologies for the delay - I just read your design doc and am catching up on this discussion.
Sorry if I missed something, but could you clarify the use case here ? I can see two kinds of scenarios

  1. Cases where we just want to use existing classifier like LogisticRegression in a pipeline. I guess the train() interface shouldn't really matter here as they we will be passing around SchemaRDDs in a pipeline and calling fit (thus going through the untyped API ?).
  2. Cases where developers want to implement a new Classification or Regression method. For these cases I think the strongly typed API would help in reducing the amount of cruft code and possible bugs in extracting features, labels etc.

FWIW I agree with the conclusion of keeping LabeledPoint simple as (Double, Array[Double]). And I think predictRaw is also probably fine as meaning of the values returned may vary (as noted in your comment).

@Lewuathe
Copy link
Contributor

My initial question was based on the viewpoint of developer api.
Simple api seems to restrict the possibility of implementation of new algorithm. As @shivaram mentioned, developer cannot use optimization api because these datatypes does not support simply multi classification etc. So I think typed interface does not necessarily be public in this context. If you have different viewpoint such as end user of this api, that opinion should be included in this discussion.

@srowen
Copy link
Member

srowen commented Dec 10, 2014

So, I may not be 100% up to speed with the new API and these changes, so my comments may be a bit off, but:

An Estimator makes a Model. To make a model, you need "raw data" and its interpretation, if you will. a LabeledPoint is "raw data". That alone is not sufficient to train a Classifier (Estimator). Yes, this extra info has to come from somewhere.

I agree that SchemaRDD contains, or could contain, or could be made to deduce, this extra interpretation, so the SchemaRDD API makes sense to me.

If LabeledPoint is to remain the "raw data", given the conversation here, then it has to be parameters or something. I think you still need these for testing, right? you still need to know what the raw data means. Or is it assumed that the built Classifier / Model stores this info?

This is sort of a rehash of the same exchange we just had, in that the question is caused by the input data abstraction not really containing all the input -- the metadata comes along separately. Which could be OK but yes it means this question pops up somewhere else in the API.

Yes, a Model may be able to remember the metadata and accept raw LabeledPoints in the future. You just have to make sure you are feeding raw LabeledPoints that use the same metadata, but that's a given no matter how you design this.

To answer the question: given the question, I'd hide the typed API, I suppose. I think the typed API has to take some other values to contain metadata like the type of features, etc. These could be more parameters, then? it kind of overloads the meaning, since the parameters look like they are intended to be hyper parameters. But it's not crazy.

Transformations: these feel like these could meaningfully operate on raw data, so, typed API makes sense to me and could be public now.

@jkbradley
Copy link
Member Author

Thanks everyone for all of the comments!

@shivaram No problem, thanks for checking out the design doc! The 2 main use cases you listed are correct. There is a remaining question about whether the strongly typed API should remain public; my initial plan was to make it public, but the above discussion is making me wonder if part should be private.

@Lewauthe I agree about discussing whether the typed API should be public or private. (But I'm not quite sure what you meant about the optimization API; please clarify.) I hope the updates to the design doc below help.

@srowen It sounds like we're converging to a solution, and I hope some use cases in the design doc will help.

I've updated the design doc for [https://issues.apache.org/jira/browse/SPARK-3702], which is the parent task of [https://issues.apache.org/jira/browse/SPARK-4789]. (I'm asking someone to add the subtask connection.) The design doc is linked from the main JIRA and also here:
[https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/edit?usp=sharing]

Perhaps we can discuss more on the JIRA rather than here.

asfgit pushed a commit that referenced this pull request Dec 11, 2014
…ctors

Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors.  Added test to suite which failed before but works now.

Needed for [#3637]

CC: marmbrus

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3646 from jkbradley/sql-reflection and squashes the following commits:

796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors.  Added test to suite which failed before but works now.
@jkbradley
Copy link
Member Author

This PR is paused until next week, pending some discussion.

@jkbradley
Copy link
Member Author

I just pushed an update based on major design refactoring, but I still need to add 1 item (DeveloperApiExample for Java) and update the PR description. Will do soon...

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24905 has started for PR 3637 at commit 1be9892.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24905 has finished for PR 3637 at commit 1be9892.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // where index i corresponds to class i (i = 0, 1).
    • new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",
    • class VectorUDT extends UserDefinedType[Vector]
    • case class Sort(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24905/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24912 has started for PR 3637 at commit bb02582.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24913 has started for PR 3637 at commit 1bb315f.

  • This patch merges cleanly.

1.0 / (1.0 + math.exp(-margin))

// Output selected columns only.
// This is a bit complicated since it tries to avoid repeated computation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented transform() here to avoid repeated computation. This improves upon the default implementation in ProbabilisticClassificationModel. However, it’s a lot of code, so I would be fine with removing it. There is also a question of whether all algorithms should implement a method which would allow the ProbabilisticClassificationModel.transform implementation to avoid repeated computation:

  • protected def raw2prob(rawPredictions: Vector): Vector = // compute probabilities from raw predictions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is duplicated and what duplicate computation is being avoided? Can some refactoring of both be done to make them more modular?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data flow goes:
features --> rawPrediction --> probabilities --> prediction

If we want all of these values, then it is fastest to compute them in this sequence. If we only want some of these values, then we do not want to pollute the Schema namespace (names of columns) by computing all 4 values. Therefore, we need UDFs to handle all possible downstream links here.

We could abstract the key links: features2raw, raw2prob, prob2pred, raw2pred, and then compose those as needed. Does that sound best?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I think abstracting the key links might be best. It will certainly make LogisticRegression much shorter since prediction takes up most of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a part of this PR ? I'm also okay with adding a TODO here to add the functions later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a TODO for now...but hope I have time to make it part of this PR.

@jkbradley jkbradley changed the title [SPARK-4789] [mllib] Standardize ML Prediction APIs [SPARK-4789] [SPARK-4942] [mllib] Standardize ML Prediction APIs Dec 30, 2014
@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26875 has started for PR 3637 at commit fec348a.

  • This patch merges cleanly.

@shivaram
Copy link
Contributor

shivaram commented Feb 6, 2015

Thanks @jkbradley . Could you summarize what (if any) public APIs we are releasing as a part of this change ?

@jkbradley
Copy link
Member Author

@shivaram I just updated the description at the top with the list of public changes.

@jkbradley
Copy link
Member Author

I believe that last commit covers everything. My only question now is whether I should remove all of the MimaExcludes lines for spark.ml since I added a blanket exception (since spark.ml is an alpha component). Or are they nice for the record?
CC: @mengxr

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26875 has finished for PR 3637 at commit fec348a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaDeveloperApiExample
    • // where index i corresponds to class i (i = 0, 1).
    • // where index i corresponds to class i (i = 0, 1).
    • class DoubleParam(parent: Params, name: String, doc: String, defaultValue: Option[Double])
    • class IntParam(parent: Params, name: String, doc: String, defaultValue: Option[Int])
    • class FloatParam(parent: Params, name: String, doc: String, defaultValue: Option[Float])
    • class LongParam(parent: Params, name: String, doc: String, defaultValue: Option[Long])
    • class BooleanParam(parent: Params, name: String, doc: String, defaultValue: Option[Boolean])
    • new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",
    • class VectorUDT extends UserDefinedType[Vector]

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26885 has started for PR 3637 at commit 405bfb8.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26875/
Test PASSed.

@shivaram
Copy link
Contributor

shivaram commented Feb 6, 2015

Sounds good. Thanks @jkbradley

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26885 has finished for PR 3637 at commit 405bfb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaDeveloperApiExample
    • // where index i corresponds to class i (i = 0, 1).
    • * Here, we have a trait to be mixed in with the Estimator and Model (MyLogisticRegression
    • * class since the maxIter parameter is only used during training (not in the Model).
    • // where index i corresponds to class i (i = 0, 1).
    • class DoubleParam(parent: Params, name: String, doc: String, defaultValue: Option[Double])
    • class IntParam(parent: Params, name: String, doc: String, defaultValue: Option[Int])
    • class FloatParam(parent: Params, name: String, doc: String, defaultValue: Option[Float])
    • class LongParam(parent: Params, name: String, doc: String, defaultValue: Option[Long])
    • class BooleanParam(parent: Params, name: String, doc: String, defaultValue: Option[Boolean])
    • new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26885/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 6, 2015
…ion APIs

This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
  * The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.

### Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
 * There is no strongly typed training-time API.
 * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
 * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
 * This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
 * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
 * Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
 * Update validateAndTransformSchema
 * Update transform
* (Updated examples, test suites according to other changes)

New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
 * Added Java version too

Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites

CC: mengxr  etrain  shivaram

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:

405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:

(cherry picked from commit dc0c449)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in dc0c449 Feb 6, 2015
@mengxr
Copy link
Contributor

mengxr commented Feb 6, 2015

LGTM. Merged into master and branch-1.3. Thanks everyone for the discussion! @jkbradley We can remove mima excludes in another PR.

@pgirolami
Copy link

After doing a clean pull of master from the repo a couple minutes ago, it fails to build due to commit dc0c449 for this PR .
Build was done using mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -Phive -Phive-thriftserver -Pbigtop-dist -DskipTests package install
If I should report this as a JIRA, please let me know.

[ERROR] warning: [options] bootstrap class path not set in conjunction with -source 1.6
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java:87: error: incomparable types: Object and int
[ERROR]     assert(model.fittingParamMap().apply(lr.maxIter()) == 10);
[ERROR]                                                        ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java:112: error: incomparable types: Object and int
[ERROR]     assert(model2.fittingParamMap().apply(lr.maxIter()) == 5);
[ERROR]                                                         ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/regression/JavaLinearRegressionSuite.java:79: error: incomparable types: Object and int
[ERROR]     assert(model.fittingParamMap().apply(lr.maxIter()) == 10);
[ERROR]                                                        ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/regression/JavaLinearRegressionSuite.java:85: error: incomparable types: Object and int
[ERROR]     assert(model2.fittingParamMap().apply(lr.maxIter()) == 5);

@jkbradley
Copy link
Member Author

@pgirolami What version of Java are you using? Those work for me. However, I should have used assertEquals(), so I'll submit a patch for that which should fix the compilation problems you're encountering.

@srowen
Copy link
Member

srowen commented Feb 6, 2015

@jkbradley I am seeing the same compilation errors. I'm compiling with Java 8 FWIW.

@pgirolami
Copy link

@jkbradley Maven says it's using JDK 6

Philippes-MacBook-Air-3:~ Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.6.0_65, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"

Philippes-MacBook-Air-3:~ Philippe$ uname -a
Darwin Philippes-MacBook-Air-3.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64

@pgirolami
Copy link

Same problem using JDK 7 on my Mac

Philippes-MacBook-Air-3:spark Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"

@srowen
Copy link
Member

srowen commented Feb 7, 2015

BTW this has been fixed already in master. But it looks like it is a small difference in Java 6 vs 7:
http://stackoverflow.com/questions/16119638/differences-in-auto-unboxing-between-java-6-vs-java-7
EDIT: wait, that's the other way around, isn't it? Here, Java 6 works. Color me confused.

@pgirolami
Copy link

For me it works with neither Java 6 nor Java 7.
Unless, of course, maven -version is lying...

Just pulled master and ran the same maven command (no clean) and indeed, it now compiles mllib (using Java 7). Thanks !

@petro-rudenko
Copy link
Contributor

One more issue. In LogisticRegressionWithLBFGS class there's a line:

this.setFeatureScaling(true)

I have feature scaling as a part of pipeline to produce new columns based on scaled columns. But i can't tell to the LogisticRegression class from the new API to set feature scaling to false. Also there's no actual way to set validateData = false in GeneralizedLinearAlgorithm.

@jkbradley
Copy link
Member Author

@petro-rudenko (Apologies for the slow response; I've been without Internet for a week.) About feature scaling, I don't think it's a problem in terms of correctness for LogisticRegression since scaling twice should produce the same result as scaling once (since it handles scaling at prediction time too). It is wasted computation though.

For GLMs, you should be able to call setValidateData: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L156]

@petro-rudenko
Copy link
Contributor

@jkbradley i can setValidateData in GLM, but not in the LogisticRegression class from the new API. For my case found a trick to customize anything i want (add org.apache.spark.ml package to my project and extends any class). When this API would be public it would be easier to customize (e.g. use LogisticRegressionWithSGD except for LRWithLBFGS) in user's namespace.

@jkbradley
Copy link
Member Author

Oh, I see. It's true there are missing features in spark.ml currently. Please feel free to make a JIRA to prioritize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.