-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5253] [ML] LinearRegression with L1/L2 (ElasticNet) using OWLQN #4259
Conversation
Test build #26284 has finished for PR 4259 at commit
|
Test build #26283 has finished for PR 4259 at commit
|
Test build #26291 has finished for PR 4259 at commit
|
Test build #26292 has finished for PR 4259 at commit
|
Test build #26296 has started for PR 4259 at commit
|
i will kick this off again once i restart jenkins. |
jenkins, test this please. |
Test build #26298 has finished for PR 4259 at commit
|
@@ -23,6 +23,27 @@ private[ml] trait HasRegParam extends Params { | |||
def getRegParam: Double = get(regParam) | |||
} | |||
|
|||
private[ml] trait HasElasticNetParam extends HasRegParam { | |||
/** param for elastic net regularization parameter */ | |||
val alphaParam: DoubleParam = new DoubleParam( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move it to LinearRegression
and rename it to alpha
. This is not a shared param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOR will have it eventually as well. Do you mean we should with HasAlpha with HasRegParam
instead of with HasElasticNetParam
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for not using HasRegParam at all and using elastic net-specific terminology such as l1RegParam, l2RegParam or something else which makes it clear which param applies to which part of the regularization.
@dbtsai Thanks for contributing elastic-net! It may be hard to merge this into 1.3 now. There are two pending 1.3 features that are relevant: DataFrame API and Prediction APIs, both of which will conflict with this PR. We may want to wait for them merged first and update this PR to avoid resolve conflicts frequently. |
I just rebased and addressed the api changes. I was thinking about that when we do standardization on the feature, we're actually rescaling on the data which is basically rescaling on the covariant position of the equation. However, we can achieve the same scaling on rescaling on the graidentSum which is contravariant position. Thus, we don't need to apply the scaler on the data which will be much cheaper. This also works for the logistic regression as well. For intercept in LOR, we can deal with it in gradient function instead of using applyBias, so combining both of those two techniques, we don't have to create new dataset. |
Test build #26812 has finished for PR 4259 at commit
|
@dbtsai I'd like to make a pass over this, but I realized that it has conflicts because of the developer api PR committed last week: [https://github.com//pull/3637] Could you please rebase? I don't think there are any more big PRs coming up which will make you rebase again. Thank you! |
@dbtsai we are close to merge this PR which brings OWLQN and PQN under the umbrella of proximal algorithms to support most of the interesting ML related constraints scalanlp/breeze#364 I also have another Breeze PR that merges OWLQN and PQN through Proximal Quasi Newton method but that will take time to stabilize.. I would like to add more tests for OWLQN and specifically comparisons with liblinear...Most likely if I find some interesting testcases in your PR, I will move them to NonlinearMinimizerTest... I will open up a mllib PR that uses NonlinearMinimizer and tests a family of consensus algorithms (Block Coordinate Descent and ADMM)..Our main focus is ElasticNet as well so it will be great if this PR is merged so that I can build upon it... |
this is linear regression...what happened to the logistic regression elastic net? We are more interested in that one... |
@jkbradley I will rebase soon. @debasish83 I'll add MLOR with elastic-net when we stabilize the new ML api. Doing this in old codebase will be huge effort, and I will like to make them self-contain as much as possible instead of sharing the same generalized linear algorithm base class. We already have enough if else statement to deal with the differences in the base class. |
Test build #29042 has finished for PR 4259 at commit
|
@jkbradley and @mengxr I just rebased it. Will do couple optimizations to avoid the scaling on the datasets which can be done in the optimization process instead. You guys can start to give me feedback so we have ample time to address issues before 1.4. Thanks. |
Test build #29043 has finished for PR 4259 at commit
|
Test build #29044 has finished for PR 4259 at commit
|
@@ -34,6 +34,43 @@ private[ml] trait HasRegParam extends Params { | |||
def getRegParam: Double = get(regParam) | |||
} | |||
|
|||
private[ml] trait HasElasticNetParam extends HasRegParam { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to call this "alphaParam?" That name assumes people have read elastic net papers. What about something like:
- "elasticNetParam" (very explicit) (I assume this was your original name.)
- "regMixing" (also explicit)
- "regAlpha" (at least will be grouped next to "regParam")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I voted for regAlphaParam: Double
for the variable name, and keep HasElasticNetParam
as the trait name. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names should definitely match, so we should pick one. I'm OK with "elasticNetParam" or "regAlpha." (I think "regAlpha" doesn't need "Param" attached to it since "alpha" is the name of the parameter.) Since "elastic net" is more easily recognized than "alpha," I vote for "elasticNetParam."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! I'm convinced to do something like
private[ml] trait HasElasticNetParam extends HasRegParam {
/**
* param for elastic net regularization parameter
* @group param
*/
val elasticNetParam: DoubleParam =
new DoubleParam(this, "elasticNetParam", "the ElasticNet mixing parameter")
/** @group getParam */
def getElasticNetParam: Double = get(elasticNetParam)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like that it extends HasRegParam. When you update it, can you please make the doc specify more about the parameter (range, plus which end of the range corresponds to L1 vs. L2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. It seems that there is no easy way to specify the valid parameter range or requirement in this framework. Do you think it's a good idea that I add the check like the following?
def getElasticNetParam: Double = {
val elasticNetParam = get(elasticNetParam)
require(elasticNetParam >= 0)
require(elasticNetParam <= 1)
elasticNetParam
}
@dbtsai Thanks for the update! I'll make a more detailed pass soon. One bigger question: What is the plan for extensions? Based on this PR, it sounds like we expect "LinearRegression" to only support no regularization, L1, L2, or elastic net. I assume that, in the future, users who want other regularization can write their own extensions of GeneralizedLinearAlgorithm (once that is in spark.ml). But would we ever want to add more built-in regularizers? Or are we deciding now that we never will? |
@jkbradley I think we should only support basic regularization in spark.ml first which is what python scikit-learn does. If users have the need of different type of regularization, they can implement it based on the code we have. It will be hard to implement GeneralizedLinearAlgorithm with regularization without using a lot of if-else statement to handle the special case. I implemented logistic regression, linear regression, and cox proportional-hazards regression with elasticnet regularization at Alpine, and our customers are asking for precise accuracy compared with R's glmnet package. As a result, I spent some time to research the original R's glmnet code, and I found that there is no generic way to handle different linear models. There are special cases here and there. For example, in logistic regression, the intercept is computed by adding extra one dimension in the data with constant one, but in linear regression, the intercept is computed by As a result, I would like to implement them separately and make sure we have the same accuracy compared with R with proper tests first, and then we can abstract out the common part. I have another PR trying to do this, #1518 and I will continuous on that after this PR is merged. |
@dbtsai That sounds reasonable. I think we could do a general implementation based on abstractions for regularization and loss functions, but it might not be as efficient as specializations (such as closed-form computation of the intercept for linear regression). I'm OK with this for now, and I definitely support comparisons with R for correctness. Let me know when it's all ready so I can make a detailed pass! |
Test build #29327 has finished for PR 4259 at commit
|
Test build #29990 has finished for PR 4259 at commit
|
Test build #30508 has finished for PR 4259 at commit
|
Test build #30510 has finished for PR 4259 at commit
|
* Trait for shared param elasticNetParam. | ||
*/ | ||
@DeveloperApi | ||
trait HasElasticNetParam extends Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a shared param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you suggest to move it to LinearRegression.scala? we will use it in LOR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
Test build #31090 has finished for PR 4259 at commit
|
} | ||
|
||
axpy(-diffSum, correction, result) | ||
scal(1.0 / totalCnt, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I finally found why correction
effect is zero. It's because diffSum
is zero in our test dataset. diffSum
is sum of diff
, and for a synthetic dataset generated from linear equation with noise, the average of diff
will be zero. As a result, for a real non-linear dataset, diffSum
will not be zero, so we need some non-linear dataset for testing correctness. I'll add famous prostate cancer dataset used in the linear regression lasso paper into the unit-test.
…ther synthetic dataset which can catch the bug fixed in this commit.
Test build #31130 has finished for PR 4259 at commit
|
Test build #31133 has finished for PR 4259 at commit
|
LGTM. Merged into master. Thanks!! |
Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbtsai@alpinenow.com> Closes apache#4259 from dbtsai/lir and squashes the following commits: a81c201 [DB Tsai] add import org.apache.spark.util.Utils back 9fc48ed [DB Tsai] rebase 2178b63 [DB Tsai] add comments 9988ca8 [DB Tsai] addressed feedback and fixed a bug. TODO: documentation and build another synthetic dataset which can catch the bug fixed in this commit. fcbaefe [DB Tsai] Refactoring 4eb078d [DB Tsai] first commit
Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbtsai@alpinenow.com> Closes apache#4259 from dbtsai/lir and squashes the following commits: a81c201 [DB Tsai] add import org.apache.spark.util.Utils back 9fc48ed [DB Tsai] rebase 2178b63 [DB Tsai] add comments 9988ca8 [DB Tsai] addressed feedback and fixed a bug. TODO: documentation and build another synthetic dataset which can catch the bug fixed in this commit. fcbaefe [DB Tsai] Refactoring 4eb078d [DB Tsai] first commit
No description provided.