Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the squared L2 norm operator for L2 regularization #5030

Merged
merged 5 commits into from
Oct 26, 2017

Conversation

abhinavarora
Copy link
Contributor

No description provided.

@abhinavarora abhinavarora self-assigned this Oct 24, 2017
@abhinavarora abhinavarora changed the title Adding the L2 loss operator for L2 regularization Adding the squared L2 norm operator for L2 regularization Oct 24, 2017
PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should be not null.");

ctx->SetOutputDim("Out", {1});
ctx->ShareLoD("X", /*->*/ "Out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the output is a scalar, there is no need to pass the LoD from input to output. So this line can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1ff3d8d

framework::OpAttrChecker* op_checker)
: framework::OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "(Tensor) The input of squared_l2_norm op.");
AddOutput("Out", "(Tensor) The output of squared_l2_norm op.");
Copy link
Contributor

@qingqing01 qingqing01 Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(Tensor) The output of squared_l2_norm op is a scalar."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. Fixed in 1ff3d8d

auto out = framework::EigenVector<T>::Flatten(*Out);
auto place = context.GetEigenDevice<Place>();

out.device(place) = x.square().sum();
Copy link
Contributor

@qingqing01 qingqing01 Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The SquaredL2Norm is good and clear name, the square of L2 norm.

  2. But if for L2 regularization, usually the overall cost doest not contain the regularization term (also called a weight decay term) in most framework, including the old PaddlePaddle framework. So there is no need to use this op in the forward. Although, the cost function (as follows) contains this term.

    see the formula in this link :
    http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

    And in the backward backpropagation, the derivative of the overall cost function J(w, b) is:

    see the formula in this link : http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

    the weight decay term will be a linear operation on W. (Discussed with @lcy-seso serveral days ago). So only needs a scale op for the L2 Regularization in the parameter updating process. Also can see the momentum updating formula in the paper, the weight decay is a linear operation on W.

So I'm not sure whether this op is needed and whether there is another use scenarios.

Copy link
Collaborator

@reyoung reyoung Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Regularization Loss is a term of the overall loss equation. If we want to plot an overall loss graph along every iteration for a classification task, the loss should equal classification loss + regularization loss.

  2. weight decay just fits L2-norm regularization. We try to implement a common way for regularization, which fits both L1 and L2.

  3. We will provide a weight decay in the future PRs. It is not a conflict that we provide both weight decay and L2 Norm operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qingqing01 @lcy-seso Both these regularization techniques have their advantages and disadvantages. Let me summarize them as follows:

Weight Decay in Optimizer Separate Operators for regularization
No Forward prop op, hence faster There will be a forward prop op
Will not support making plots of total loss vs epoch/iteration Will support those plots
This does not generalize well beyond L2 and L1 regularization. Use of Batch norm and layer norm might invalidate this approach This is a very general approach and can be applied to any kind of network
It is not easy for researchers to add new regularizers in this framework because the regularization is tightly coupled with optimizers. They might have to change all optimizers. Adding new regularization schemes is easy as the code for regularization is independent of optimization.
Frameworks that support this: Pytorch, Caffe Frameworks that support this: Tensorflow, Theano, Lasagne

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that forward prop op might be necessary because while training the model, it is a very common practice to check convergence by making plots of the total loss function vs time. Also during inference, an intelligent executor can easily prune out the graph to remove the regularization nodes.

Also I agree with @reyoung that we can also implement the weight decay separately. That can be only for the case of the L2 and L1 Penalty loss.

Please let me know what you think about this plan?

Copy link
Contributor

@qingqing01 qingqing01 Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For the plotting
    In the old PaddlePaddle framework and Caffe, I think we usually plot the loss without regularization term vs time.

  • For the regularization
    I always agree to separate operators for regularization, not implementing in the optimizer. But I think whether to add the regularization loss to the overall loss depends on the users, not by default. Since it all add more calculation during the training. And the default regularization is only to use L2LrRegularizer (a linear operator)/ L1LrRegularizer operators like the implementation in https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/parameter/Regularizer.h for the parameters updating.

  • About this PR
    I agree to merge this Op. But whether to use it in the forward and add the regularization loss to the overall loss depends on the users, not default by the framework, if using the regularization in the training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qingqing01 @lcy-seso Thank you for going through the PR. I discussed this today with @reyoung and we feel that your point is valid. We can add these ops but whether to use them in forward and add to the loss, will be the user's choice. In another PR, I will add separate operators for regularization which will be used only in the backward pass and not implemented in the optimizer.
In case of L2, this will be a simple scale op and in the case of L1 regularization this will be a combination of scale and sign op.
Thank you so much for your feedback on this.

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @qingqing01 anyway we can merge this PR, right?

@qingqing01
Copy link
Contributor

@abhinavarora @reyoung I agree to merge this PR.

@abhinavarora abhinavarora merged commit b0a267c into PaddlePaddle:develop Oct 26, 2017
@abhinavarora abhinavarora deleted the l2_loss branch October 26, 2017 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants