-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the squared L2 norm operator for L2 regularization #5030
Conversation
PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should be not null."); | ||
|
||
ctx->SetOutputDim("Out", {1}); | ||
ctx->ShareLoD("X", /*->*/ "Out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the output is a scalar, there is no need to pass the LoD from input to output. So this line can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 1ff3d8d
framework::OpAttrChecker* op_checker) | ||
: framework::OpProtoAndCheckerMaker(proto, op_checker) { | ||
AddInput("X", "(Tensor) The input of squared_l2_norm op."); | ||
AddOutput("Out", "(Tensor) The output of squared_l2_norm op."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(Tensor) The output of squared_l2_norm op is a scalar."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. Fixed in 1ff3d8d
auto out = framework::EigenVector<T>::Flatten(*Out); | ||
auto place = context.GetEigenDevice<Place>(); | ||
|
||
out.device(place) = x.square().sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The
SquaredL2Norm
is good and clear name, the square of L2 norm. -
But if for L2 regularization, usually the overall cost doest not contain the regularization term (also called a weight decay term) in most framework, including the old PaddlePaddle framework. So there is no need to use this op in the forward. Although, the cost function (as follows) contains this term.
see the formula in this link :
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_AlgorithmAnd in the backward backpropagation, the derivative of the overall cost function J(w, b) is:
see the formula in this link : http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
the weight decay term will be a linear operation on W. (Discussed with @lcy-seso serveral days ago). So only needs a scale op for the L2 Regularization in the parameter updating process. Also can see the momentum updating formula in the paper, the weight decay is a linear operation on W.
So I'm not sure whether this op is needed and whether there is another use scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Regularization Loss is a term of the overall loss equation. If we want to plot an overall loss graph along every iteration for a classification task, the loss should equal
classification loss + regularization loss
. -
weight decay
just fitsL2-norm regularization
. We try to implement a common way for regularization, which fits both L1 and L2. -
We will provide a
weight decay
in the future PRs. It is not a conflict that we provide bothweight decay
andL2 Norm
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qingqing01 @lcy-seso Both these regularization techniques have their advantages and disadvantages. Let me summarize them as follows:
Weight Decay in Optimizer | Separate Operators for regularization |
---|---|
No Forward prop op, hence faster | There will be a forward prop op |
Will not support making plots of total loss vs epoch/iteration | Will support those plots |
This does not generalize well beyond L2 and L1 regularization. Use of Batch norm and layer norm might invalidate this approach | This is a very general approach and can be applied to any kind of network |
It is not easy for researchers to add new regularizers in this framework because the regularization is tightly coupled with optimizers. They might have to change all optimizers. | Adding new regularization schemes is easy as the code for regularization is independent of optimization. |
Frameworks that support this: Pytorch, Caffe | Frameworks that support this: Tensorflow, Theano, Lasagne |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that forward prop op might be necessary because while training the model, it is a very common practice to check convergence by making plots of the total loss function vs time. Also during inference, an intelligent executor can easily prune out the graph to remove the regularization nodes.
Also I agree with @reyoung that we can also implement the weight decay separately. That can be only for the case of the L2 and L1 Penalty loss.
Please let me know what you think about this plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
For the plotting
In the old PaddlePaddle framework and Caffe, I think we usually plot the loss without regularization term vs time. -
For the regularization
I always agree to separate operators for regularization, not implementing in the optimizer. But I think whether to add the regularization loss to the overall loss depends on the users, not by default. Since it all add more calculation during the training. And the default regularization is only to useL2LrRegularizer
(a linear operator)/L1LrRegularizer
operators like the implementation in https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/parameter/Regularizer.h for the parameters updating. -
About this PR
I agree to merge this Op. But whether to use it in the forward and add the regularization loss to the overall loss depends on the users, not default by the framework, if using the regularization in the training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qingqing01 @lcy-seso Thank you for going through the PR. I discussed this today with @reyoung and we feel that your point is valid. We can add these ops but whether to use them in forward and add to the loss, will be the user's choice. In another PR, I will add separate operators for regularization which will be used only in the backward pass and not implemented in the optimizer.
In case of L2, this will be a simple scale op and in the case of L1 regularization this will be a combination of scale and sign op.
Thank you so much for your feedback on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @qingqing01 anyway we can merge this PR, right?
@abhinavarora @reyoung I agree to merge this PR. |
No description provided.