-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055
Conversation
ok to test |
It looks reasonable, but this then forces feature scaling, which sort of changes behavior. |
Test build #28716 has finished for PR 5055 at commit
|
test failed. Is there something wrong with the test case? set feature scaling true will change the predict result. @srowen @AmplabJenkins File "pyspark/mllib/classification.py", line 232, in main.SVMModel |
No, it's almost certainly a result of your change. The prediction of the SVM changes if you force it to scale features. This is why I'm not sure this is the right thing to do; it swaps one hard-coded behavior for the other, and changes behavior. I'd prefer to simply make this selectable for all models. |
Test build #28757 has finished for PR 5055 at commit
|
provide a interface in object SVMWithSGD,to set useFeatureScaling
Agree with you. I commited a version to make SVMWithSGD class public ,but it fails Spark unit tests. I don't know why, Maybe provide a interface in object SVMWithSGD is also ok? @srowen |
Test build #28780 has finished for PR 5055 at commit
|
Test build #28782 has finished for PR 5055 at commit
|
Hm, maybe. I am sort of reluctant to add yet another utility method overload to expose this when we could just add it as a setter. @mengxr did you say you also favored not adding more of these methods in the |
Test build #28850 has finished for PR 5055 at commit
|
make setFeatureScaling public
Test build #28854 has finished for PR 5055 at commit
|
Yes,I have made this constructor and setter public.@srowen |
OK. I want to see if @mengxr is OK with this as it adds a new bit of API, technically. I think we'd want to document the params and perhaps mark this as |
Document the params of SVMWithSGD constructor and mark it as ::Experimental::
Test build #28902 has finished for PR 5055 at commit
|
I already document the params of SVMWithSGD constructor and mark it as ::Experimental:: |
Sorry for my late response! I tend to not expose this method to keep the API lightweight. If we feel feature scaling is good, we can enable it by default. For example, LIBLINEAR uses feature scaling but it doesn't expose it to users. |
Apologies for my late response too. I feel like what we really need to do is clarify the intended behavior of feature scaling. Currently, feature scaling changes the optimal solution vector for regularized models since it changes each feature's relative amount of regularization. I see 2 options:
I strongly vote for the 2nd option: It has the intended benefit of improving optimization behavior, and it is better for users since it gives them the solution they expect. |
@tanyinyan are you in a position to implement @jkbradley's suggestion? it's more involved for sure but it would indeed not change the solution, which sounds nice. I apologize for taking this down the wrong path of "option 1", exposing as an option. |
Hi @jkbradley , @srowen , I'm considering "option 2" these days, and look up what libsvm and liblinear does . And find that, both in libsvm and liblinear , scaling changes the performance. In libsvm guide:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf ," We recommend linearly The same as in liblinear:http://cran.r-project.org/web/packages/LiblineaR/LiblineaR.pdf , "Classification models usually perform better if each dimension of the data is first centered and scaled." The goal to do scaling is to improve accuracy and the convergence rate, if the optimal solution vector is identical(before and after scaling), then it is meaningless to do scaling. So, I suggest "setFeatureScaling(true)" (don't expose it as an option) as what is done in class LogisticRegressionWithLBFGS |
@tanyinyan I think what you're arguing for is actually option (1). I propose this combination of the solutions: Expose setFeatureScaling() as an option. Default to true. If featureScaling is true, then we scale features and do not adjust regularization. This will change the optimal solution, but as in your references, it is generally better to do anyways. (My experience is the same.) If featureScaling is false, then we scale features internally but also adjust regularization. This will improve optimization behavior but will not change the optimal solution. Defaulting to true will mean the algorithm will probably do the best thing by default, but will allow informed users to get what they really want if necessary. This proposal will also avoid an API change since the meaning of featureScaling will stay the same. |
" If featureScaling is false, then we scale features internally but also adjust regularization. This will improve optimization behavior but will not change the optimal solution." @jkbradley , I'm not understanding the meaning of "optimization behavior" here, does it means convergence rate? If we scale features internally and also adjust regularization, then we will get the same gradient as not scale features for every labeled point, so I think if the optimal solution is not changed , so does the optimization behavior. Have I understood correctly? |
@tanyinyan "Optimization behavior" means convergence rate, yes. If we scale feature internally and also adjust regularization, then:
Does that make sense? We could actually adjust the step size for each feature, rather than scaling the data. Come to think of it, that might be a more efficient solution since it's cheaper than creating a new copy of the data. I'll make a JIRA for that since it belongs in another PR. |
After speaking with @mengxr we might want to make some hard decisions about changing behaviors and hiding feature scaling. I noted them here: [https://issues.apache.org/jira/browse/SPARK-6683] |
Can one of the admins verify this patch? |
I think this is WontFix in favor of SPARK-6683? |
Yes. @tanyinyan Do you mind closing this for now? Thanks everyone for the discussion! |
ok. @mengxr , @jkbradley how is SPARK-6683 going? I'm very glad if I could join and do some contribution to it. |
SPARK-6683 will be more like doing the feature scaling within the objective function. LinearRegression with L1/L2 (ElasticNet) using OWLQN, #4259 does the scaling internally so there is no need for creating a new dataset. |
set useFeatureScaling true in SVMWithSGD, the problem describled in jira (https://issues.apache.org/jira/browse/SPARK-6348)