-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637
Conversation
Test build #24226 has started for PR 3637 at commit
|
Test build #24226 has finished for PR 3637 at commit
|
Test FAILed. |
Test build #24227 has started for PR 3637 at commit
|
Test build #24227 has finished for PR 3637 at commit
|
Test FAILed. |
The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121 in schemaFor) where it gets confused if the case class includes multiple constructors. The default behavior should probably be to take the constructor with the most arguments, but I'll consult others about this. This PR may be on temporary hold...but feel free to comment since the Spark SQL fix should not require changing this PR! |
*/ | ||
@AlphaComponent | ||
@BeanInfo | ||
case class LabeledPoint(label: Double, features: Vector, weight: Double) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a label of LabeledPoint
assumed as only Double
? I think there are some cases where label is not Double
such as one-of-k encoding. It seems better not to restrict to Double
type. If I missed some alternatives, sorry for that and please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some discussion of this in the design doc linked from the JIRA. Basically, there could be a whole range of types, and it's a question of simplicity of the API vs. strong typing. I thought about templatizing LabeledPoint by LabelType and FeaturesType, but it makes developers & users have to write a bunch more boilerplate whenever they specify types. The current plan is to use Double for single labels and eventually some other type (Vector?) for multiple labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also more sympathetic with a strongly-typed API here rather than overload floating-point values to represent unordered categories. Are there really so many possibilities? Any continuous or ordinal value really does naturally translate to a double. Categoricals are the only other type of value that needs a separate representation.
I feel like this misses some opportunities to optimize the internal representation (e.g. a Dataset whose feature is known to be one of N values doesn't need a double, but potentially just N bits) and avoid ambiguities of representation (is negative -1? 0?) This is one of those areas where the 'simple' API just seems to push complexity elsewhere or ignore it. An algorithm either has to have its own checks for whether 1.0 is a category or not, or, overlooks the distinction. Same with the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you about strong types being nice. I think optimizations with using fewer bits should remain internal. For the user-facing API, it's really a question about whether we want to incur some extra overhead with strong types:
- Users & developers have to write LabeledPoint[Double, Vector] or LabeledPoint[Int, Vector] instead of LabeledPoint.
- As far as I know, we can't have default type parameters, so users could never write LabeledPoint. (a rare time I miss C++)
- Algorithm APIs get messier to look at (passing about LabelType and FeaturesType). This is especially annoying with meta-algorithms (boosting & bagging).
Personally, I'd be OK with heavy typing (coming from C++ land), but it might offend some Scala users.
CC: @mengxr since I know you have opinions about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it need not be designed with generic types. In fact it can't really since there are N features of different types. But you can have a Feature class with subclasses for ordered and categorical types. That too has its own tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I think I'd be OK with the 2nd option above as long as we can come up with simple APIs for users who don't want to think about types. That might involve default conversions between types (like one-hot encoding, and binning).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the former. Yeah, that's probably the downside. Each data element is at least an object, and you can't have it reduce to a double[]
under the hood.
In the second example, I think you'd only ever really want MixedFeatures
as an abstraction. There's no need to think of all CategoricalFeatures
as a special case deserving a unique abstraction.
I suppose if you abstract the entire training example as an object, and allow accessors like getNumericFeature(index: Int)
, getCategoricalFeature(index: Int)
you can still internally optimize the storage while exposing a richer object representation. You get the type safety and optimization opportunity.
Sure, an Array[Double]
could easily be translated into one of the more elaborate representations above. I suppose I myself wouldn't want to make it too easy to not think about types!
Anyway, up to your judgment really. There are arguments several ways here. Worth a thought to see if the idea of a bit more abstraction appeals to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main pros & cons I see for having continuous & categorical types for labels & features are:
Pros:
- Type safety.
Cons:
- Algorithms may need to make copies of the data, depending on how much we expose internals of a features type.
- Users may have to worry more about types.
- For labels, if we load data from a file without metadata (like libsvm), we may need to assume that everything is continuous. Users will have to explicitly cast labels to categorical for classification.
- For features, strong typing implies a stronger contract, where the assumption is that users specify the correct types. I've been wondering about having more "best effort" APIs, where we take suggestions from users (like DecisionTree's categoricalFeaturesInfo) but otherwise try to infer the best types to use under the hood.
These lists started out much more balanced, but I guess I'm voting for the old system where everything is a Double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good points. I think there's a "pro" for optimizing storage but that's a bit secondary.
I don't think a caller has to 'translate' 0/1 labels to categorical. These can be labels called 0 and 1. Given schema information, all of this is stuff frameworks can do for you. Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?
So, let's say my feature takes on "foo", "bar", "baz" as values. Doesn't the caller always have to translate to/from numbers? no big deal but is that simpler? I think the schema abstraction is going to help with this, I imagine.
Anyway, not really strongly arguing with you here, just completing the discussion. An array of doubles is kind of raw but certainly does the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For optimizing storage, I think we can do it in either case (Vector for weakly typing, or Array[Double] for strong typing).
I too agree both options are good & just wanted to hash out the pros & cons.
Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?
That's not quite what I meant. I was thinking of 2 use cases:
- An expert user specifies types. The algorithm should use those types.
- A beginner user does not explicitly specify types, but uses the typed API so that loaded data are given default types (probably continuous). Following the first case, the algorithm should use the given types, but following intuition, the algorithm should infer types.
This is less of an issue if the typed interface is designated as an expert API.
I'm starting to wonder if the typed interface should be completely public. I'll move this to a non-inline comment so the rest of our discussion does not get hidden in the future.
Question: Do people have preferences for the name of what is currently "predictRaw?" Possibilities are:
(I did not include predictScore since "score" is already taken by logreg to mean the probability estimate.) |
@srowen @Lewuathe Continuing the above inline discussion... Question: Should the typed interface be public? New proposal: Hide the typed interface of Estimators. Leave the typed interface of Transformers exposed. Argument:
What do you think? |
Pardon, which part are you referring to when you say the typed interface? The metadata about what is a categorical feature, what the values are, etc? I assumed that this information is in part derived from what's in the |
Oh, apologies for being unclear. I meant this division:
I agree metadata is less of an issue for Models since they can get metadata from their parent Estimator. The main issue is specifying metadata for Estimators. |
@jkbradley Apologies for the delay - I just read your design doc and am catching up on this discussion.
FWIW I agree with the conclusion of keeping LabeledPoint simple as (Double, Array[Double]). And I think |
My initial question was based on the viewpoint of developer api. |
So, I may not be 100% up to speed with the new API and these changes, so my comments may be a bit off, but: An Estimator makes a Model. To make a model, you need "raw data" and its interpretation, if you will. a LabeledPoint is "raw data". That alone is not sufficient to train a Classifier (Estimator). Yes, this extra info has to come from somewhere. I agree that SchemaRDD contains, or could contain, or could be made to deduce, this extra interpretation, so the SchemaRDD API makes sense to me. If LabeledPoint is to remain the "raw data", given the conversation here, then it has to be parameters or something. I think you still need these for testing, right? you still need to know what the raw data means. Or is it assumed that the built Classifier / Model stores this info? This is sort of a rehash of the same exchange we just had, in that the question is caused by the input data abstraction not really containing all the input -- the metadata comes along separately. Which could be OK but yes it means this question pops up somewhere else in the API. Yes, a Model may be able to remember the metadata and accept raw LabeledPoints in the future. You just have to make sure you are feeding raw LabeledPoints that use the same metadata, but that's a given no matter how you design this. To answer the question: given the question, I'd hide the typed API, I suppose. I think the typed API has to take some other values to contain metadata like the type of features, etc. These could be more parameters, then? it kind of overloads the meaning, since the parameters look like they are intended to be hyper parameters. But it's not crazy. Transformations: these feel like these could meaningfully operate on raw data, so, typed API makes sense to me and could be public now. |
Thanks everyone for all of the comments! @shivaram No problem, thanks for checking out the design doc! The 2 main use cases you listed are correct. There is a remaining question about whether the strongly typed API should remain public; my initial plan was to make it public, but the above discussion is making me wonder if part should be private. @Lewauthe I agree about discussing whether the typed API should be public or private. (But I'm not quite sure what you meant about the optimization API; please clarify.) I hope the updates to the design doc below help. @srowen It sounds like we're converging to a solution, and I hope some use cases in the design doc will help. I've updated the design doc for [https://issues.apache.org/jira/browse/SPARK-3702], which is the parent task of [https://issues.apache.org/jira/browse/SPARK-4789]. (I'm asking someone to add the subtask connection.) The design doc is linked from the main JIRA and also here: Perhaps we can discuss more on the JIRA rather than here. |
…ctors Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now. Needed for [#3637] CC: marmbrus Author: Joseph K. Bradley <joseph@databricks.com> Closes #3646 from jkbradley/sql-reflection and squashes the following commits: 796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
This PR is paused until next week, pending some discussion. |
83109eb
to
1be9892
Compare
I just pushed an update based on major design refactoring, but I still need to add 1 item (DeveloperApiExample for Java) and update the PR description. Will do soon... |
Test build #24905 has started for PR 3637 at commit
|
Test build #24905 has finished for PR 3637 at commit
|
Test FAILed. |
Test build #24912 has started for PR 3637 at commit
|
Test build #24913 has started for PR 3637 at commit
|
1.0 / (1.0 + math.exp(-margin)) | ||
|
||
// Output selected columns only. | ||
// This is a bit complicated since it tries to avoid repeated computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented transform() here to avoid repeated computation. This improves upon the default implementation in ProbabilisticClassificationModel. However, it’s a lot of code, so I would be fine with removing it. There is also a question of whether all algorithms should implement a method which would allow the ProbabilisticClassificationModel.transform implementation to avoid repeated computation:
- protected def raw2prob(rawPredictions: Vector): Vector = // compute probabilities from raw predictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is duplicated and what duplicate computation is being avoided? Can some refactoring of both be done to make them more modular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data flow goes:
features --> rawPrediction --> probabilities --> prediction
If we want all of these values, then it is fastest to compute them in this sequence. If we only want some of these values, then we do not want to pollute the Schema namespace (names of columns) by computing all 4 values. Therefore, we need UDFs to handle all possible downstream links here.
We could abstract the key links: features2raw, raw2prob, prob2pred, raw2pred, and then compose those as needed. Does that sound best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I think abstracting the key links might be best. It will certainly make LogisticRegression much shorter since prediction takes up most of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a part of this PR ? I'm also okay with adding a TODO
here to add the functions later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a TODO for now...but hope I have time to make it part of this PR.
Test build #26875 has started for PR 3637 at commit
|
Thanks @jkbradley . Could you summarize what (if any) public APIs we are releasing as a part of this change ? |
@shivaram I just updated the description at the top with the list of public changes. |
I believe that last commit covers everything. My only question now is whether I should remove all of the MimaExcludes lines for spark.ml since I added a blanket exception (since spark.ml is an alpha component). Or are they nice for the record? |
Test build #26875 has finished for PR 3637 at commit
|
Test build #26885 has started for PR 3637 at commit
|
Test PASSed. |
Sounds good. Thanks @jkbradley |
Test build #26885 has finished for PR 3637 at commit
|
Test PASSed. |
…ion APIs This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public: * new output columns: rawPrediction, probabilities * The “score” column is now called “rawPrediction” * Classifiers now provide numClasses * Params.get and .set are now protected instead of private[ml]. * ParamMap now has a size method. * new classes: LinearRegression, LinearRegressionModel * LogisticRegression now has an intercept. ### Sketch of APIs (most of which are private[spark] for now) Abstract classes for learning algorithms (+ corresponding Model abstractions): * Classifier (+ ClassificationModel) * ProbabilisticClassifier (+ ProbabilisticClassificationModel) * Regressor (+ RegressionModel) * Predictor (+ PredictionModel) * *For all of these*: * There is no strongly typed training-time API. * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms. Concrete classes: learning algorithms * LinearRegression * LogisticRegression (updated to use new abstract classes) * Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031) Other updates: * params.scala: Changed Params.set/get to be protected instead of private[ml] * This was needed for the example of defining a class from outside of the MLlib namespace. * VectorUDT: Will later change from private[spark] to public. * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors. * Also, added equals() method.f * SPARK-4942 : ML Transformers should allow output cols to be turned on,off * Update validateAndTransformSchema * Update transform * (Updated examples, test suites according to other changes) New examples: * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace) * Added Java version too Test Suites: * LinearRegressionSuite * LogisticRegressionSuite * + Java versions of above suites CC: mengxr etrain shivaram Author: Joseph K. Bradley <joseph@databricks.com> Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits: 405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types. 8316d5e [Joseph K. Bradley] fixes after rebasing on master fc62406 [Joseph K. Bradley] fixed test suites after last commit bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame) 9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it 216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value. 343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package 82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites 0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites c3c8da5 [Joseph K. Bradley] small cleanup 934f97b [Joseph K. Bradley] Fixed bugs from previous commit. 1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform(). 4e2f711 [Joseph K. Bradley] rat fix bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite 8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others 1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0 adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites 58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap. 57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml 54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly 0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup). 601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples. d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch 52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet bfade12 [Joseph K. Bradley] Added lots of classes for new ML API: (cherry picked from commit dc0c449) Signed-off-by: Xiangrui Meng <meng@databricks.com>
LGTM. Merged into master and branch-1.3. Thanks everyone for the discussion! @jkbradley We can remove mima excludes in another PR. |
After doing a clean pull of master from the repo a couple minutes ago, it fails to build due to commit dc0c449 for this PR .
|
@pgirolami What version of Java are you using? Those work for me. However, I should have used assertEquals(), so I'll submit a patch for that which should fix the compilation problems you're encountering. |
@jkbradley I am seeing the same compilation errors. I'm compiling with Java 8 FWIW. |
@jkbradley Maven says it's using JDK 6 Philippes-MacBook-Air-3:~ Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.6.0_65, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"
Philippes-MacBook-Air-3:~ Philippe$ uname -a
Darwin Philippes-MacBook-Air-3.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64 |
Same problem using JDK 7 on my Mac Philippes-MacBook-Air-3:spark Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac" |
BTW this has been fixed already in master. But it looks like it is a small difference in Java 6 vs 7: |
For me it works with neither Java 6 nor Java 7. Just pulled master and ran the same maven command (no clean) and indeed, it now compiles mllib (using Java 7). Thanks ! |
One more issue. In LogisticRegressionWithLBFGS class there's a line: this.setFeatureScaling(true) I have feature scaling as a part of pipeline to produce new columns based on scaled columns. But i can't tell to the LogisticRegression class from the new API to set feature scaling to false. Also there's no actual way to set validateData = false in GeneralizedLinearAlgorithm. |
@petro-rudenko (Apologies for the slow response; I've been without Internet for a week.) About feature scaling, I don't think it's a problem in terms of correctness for LogisticRegression since scaling twice should produce the same result as scaling once (since it handles scaling at prediction time too). It is wasted computation though. For GLMs, you should be able to call setValidateData: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L156] |
@jkbradley i can setValidateData in GLM, but not in the LogisticRegression class from the new API. For my case found a trick to customize anything i want (add org.apache.spark.ml package to my project and extends any class). When this API would be public it would be easier to customize (e.g. use LogisticRegressionWithSGD except for LRWithLBFGS) in user's namespace. |
Oh, I see. It's true there are missing features in spark.ml currently. Please feel free to make a JIRA to prioritize it. |
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
UPDATE: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:
Sketch of APIs (most of which are private[spark] for now)
Abstract classes for learning algorithms (+ corresponding Model abstractions):
Concrete classes: learning algorithms
Other updates:
New examples:
Test Suites:
CC: @mengxr @etrain @shivaram