Adding the squared L2 norm operator for L2 regularization #5030

abhinavarora · 2017-10-24T01:00:51Z

No description provided.

qingqing01 · 2017-10-24T06:43:31Z

paddle/operators/squared_l2_norm_op.cc

+    PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should be not null.");
+
+    ctx->SetOutputDim("Out", {1});
+    ctx->ShareLoD("X", /*->*/ "Out");


since the output is a scalar, there is no need to pass the LoD from input to output. So this line can be removed.

Fixed in 1ff3d8d

qingqing01 · 2017-10-24T06:44:00Z

paddle/operators/squared_l2_norm_op.cc

+                       framework::OpAttrChecker* op_checker)
+      : framework::OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "(Tensor) The input of squared_l2_norm op.");
+    AddOutput("Out", "(Tensor) The output of squared_l2_norm op.");


"(Tensor) The output of squared_l2_norm op is a scalar."

Thank you for the feedback. Fixed in 1ff3d8d

qingqing01 · 2017-10-24T11:51:15Z

paddle/operators/squared_l2_norm_op.h

+    auto out = framework::EigenVector<T>::Flatten(*Out);
+    auto place = context.GetEigenDevice<Place>();
+
+    out.device(place) = x.square().sum();


The SquaredL2Norm is good and clear name, the square of L2 norm.

But if for L2 regularization, usually the overall cost doest not contain the regularization term (also called a weight decay term) in most framework, including the old PaddlePaddle framework. So there is no need to use this op in the forward. Although, the cost function (as follows) contains this term.

see the formula in this link :
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

And in the backward backpropagation, the derivative of the overall cost function J(w, b) is:

see the formula in this link : http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

the weight decay term will be a linear operation on W. (Discussed with @lcy-seso serveral days ago). So only needs a scale op for the L2 Regularization in the parameter updating process. Also can see the momentum updating formula in the paper, the weight decay is a linear operation on W.

So I'm not sure whether this op is needed and whether there is another use scenarios.

Regularization Loss is a term of the overall loss equation. If we want to plot an overall loss graph along every iteration for a classification task, the loss should equal classification loss + regularization loss.

weight decay just fits L2-norm regularization. We try to implement a common way for regularization, which fits both L1 and L2.

We will provide a weight decay in the future PRs. It is not a conflict that we provide both weight decay and L2 Norm operator.

@qingqing01 @lcy-seso Both these regularization techniques have their advantages and disadvantages. Let me summarize them as follows:

Weight Decay in Optimizer Separate Operators for regularization

No Forward prop op, hence faster There will be a forward prop op

Will not support making plots of total loss vs epoch/iteration Will support those plots

This does not generalize well beyond L2 and L1 regularization. Use of Batch norm and layer norm might invalidate this approach This is a very general approach and can be applied to any kind of network

It is not easy for researchers to add new regularizers in this framework because the regularization is tightly coupled with optimizers. They might have to change all optimizers. Adding new regularization schemes is easy as the code for regularization is independent of optimization.

Frameworks that support this: Pytorch, Caffe Frameworks that support this: Tensorflow, Theano, Lasagne

I think that forward prop op might be necessary because while training the model, it is a very common practice to check convergence by making plots of the total loss function vs time. Also during inference, an intelligent executor can easily prune out the graph to remove the regularization nodes.

Also I agree with @reyoung that we can also implement the weight decay separately. That can be only for the case of the L2 and L1 Penalty loss.

Please let me know what you think about this plan?

For the plotting
In the old PaddlePaddle framework and Caffe, I think we usually plot the loss without regularization term vs time.

For the regularization
I always agree to separate operators for regularization, not implementing in the optimizer. But I think whether to add the regularization loss to the overall loss depends on the users, not by default. Since it all add more calculation during the training. And the default regularization is only to use L2LrRegularizer (a linear operator)/ L1LrRegularizer operators like the implementation in https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/parameter/Regularizer.h for the parameters updating.

About this PR
I agree to merge this Op. But whether to use it in the forward and add the regularization loss to the overall loss depends on the users, not default by the framework, if using the regularization in the training.

@qingqing01 @lcy-seso Thank you for going through the PR. I discussed this today with @reyoung and we feel that your point is valid. We can add these ops but whether to use them in forward and add to the loss, will be the user's choice. In another PR, I will add separate operators for regularization which will be used only in the backward pass and not implemented in the optimizer.
In case of L2, this will be a simple scale op and in the case of L1 regularization this will be a combination of scale and sign op.
Thank you so much for your feedback on this.

reyoung

LGTM, @qingqing01 anyway we can merge this PR, right?

qingqing01 · 2017-10-26T02:01:31Z

@abhinavarora @reyoung I agree to merge this PR.

Adding the L2 loss operator for L2 regularization

e45d5b0

abhinavarora self-assigned this Oct 24, 2017

Renaming l2_loss op to squared_l2_norm_op

88a09e0

abhinavarora changed the title ~~Adding the L2 loss operator for L2 regularization~~ Adding the squared L2 norm operator for L2 regularization Oct 24, 2017

Abhinav Arora added 2 commits October 23, 2017 21:20

Merge remote-tracking branch 'origin/develop' into l2_loss

a5958bc

Merge remote-tracking branch 'origin/develop' into l2_loss

df79d30

abhinavarora requested review from reyoung and qingqing01 October 24, 2017 06:39

qingqing01 reviewed Oct 24, 2017

View reviewed changes

Addressing code review feedback

1ff3d8d

abhinavarora requested a review from lcy-seso October 24, 2017 20:41

abhinavarora mentioned this pull request Oct 24, 2017

Regularization Design for PaddlePaddle #5054

Closed

reyoung approved these changes Oct 25, 2017

View reviewed changes

abhinavarora merged commit b0a267c into PaddlePaddle:develop Oct 26, 2017

abhinavarora deleted the l2_loss branch October 26, 2017 02:03

abhinavarora mentioned this pull request Oct 26, 2017

Add python API for backward regularization ops #5135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the squared L2 norm operator for L2 regularization #5030

Adding the squared L2 norm operator for L2 regularization #5030

abhinavarora commented Oct 24, 2017

qingqing01 Oct 24, 2017

abhinavarora Oct 24, 2017

qingqing01 Oct 24, 2017 •

edited

Loading

abhinavarora Oct 24, 2017

qingqing01 Oct 24, 2017 •

edited

Loading

reyoung Oct 24, 2017 •

edited

Loading

abhinavarora Oct 24, 2017

abhinavarora Oct 24, 2017

qingqing01 Oct 25, 2017 •

edited

Loading

abhinavarora Oct 26, 2017

reyoung left a comment •

edited

Loading

qingqing01 commented Oct 26, 2017

Weight Decay in Optimizer	Separate Operators for regularization
No Forward prop op, hence faster	There will be a forward prop op
Will not support making plots of total loss vs epoch/iteration	Will support those plots
This does not generalize well beyond L2 and L1 regularization. Use of Batch norm and layer norm might invalidate this approach	This is a very general approach and can be applied to any kind of network
It is not easy for researchers to add new regularizers in this framework because the regularization is tightly coupled with optimizers. They might have to change all optimizers.	Adding new regularization schemes is easy as the code for regularization is independent of optimization.
Frameworks that support this: Pytorch, Caffe	Frameworks that support this: Tensorflow, Theano, Lasagne

Adding the squared L2 norm operator for L2 regularization #5030

Adding the squared L2 norm operator for L2 regularization #5030

Conversation

abhinavarora commented Oct 24, 2017

qingqing01 Oct 24, 2017

Choose a reason for hiding this comment

abhinavarora Oct 24, 2017

Choose a reason for hiding this comment

qingqing01 Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

abhinavarora Oct 24, 2017

Choose a reason for hiding this comment

qingqing01 Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

reyoung Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

abhinavarora Oct 24, 2017

Choose a reason for hiding this comment

abhinavarora Oct 24, 2017

Choose a reason for hiding this comment

qingqing01 Oct 25, 2017 • edited Loading

Choose a reason for hiding this comment

abhinavarora Oct 26, 2017

Choose a reason for hiding this comment

reyoung left a comment • edited Loading

Choose a reason for hiding this comment

qingqing01 commented Oct 26, 2017

qingqing01 Oct 24, 2017 •

edited

Loading

qingqing01 Oct 24, 2017 •

edited

Loading

reyoung Oct 24, 2017 •

edited

Loading

qingqing01 Oct 25, 2017 •

edited

Loading

reyoung left a comment •

edited

Loading