-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MKLDNN implementation of batch normalization #9904
MKLDNN implementation of batch normalization #9904
Conversation
@luotao1 I'm having some troubles with TeamCity builds. The build finishes with the following error message:
Do you know what seems to be the issue? |
@luotao1 I'm having some troubles with unit tests in TeamCity. The test that is failing is The output for the test is as follows:
Some of these tests seem to be failing because of incorrect memory accesses, but some of them seem to calculate results incorrectly. Some of them seem to be related to batch normalization. I did some changes in plain batch norm, but I haven't touched the plain or GPU implementations of batch normalization. Do you know the reason why these tests are failing? PS. I noticed there is an issue with |
Yes, our |
@luotao1 Thanks for your comment. One more question, how do I can see two different types of failures:
or this one:
|
@luotao1 Could you have a look at the code, or point out someone who could review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tensor-tang Could you help review batch_norm_mkldnn_op.cc
?
from test_batch_norm_op import TestBatchNormOpInference, TestBatchNormOpTraining, _reference_training, _reference_grad | ||
|
||
|
||
class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the batch_norm test like this:
Paddle/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py
Lines 20 to 22 in 5e13314
class TestMKLDNN(TestConv2dOp): | |
def init_kernel_type(self): | |
self.use_mkldnn = True |
Only set init_kernel_type
, which will run mkldnn_batch_norm kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I corrected this part. I introduced init_kernel_type
method in batch norm test cases, that set use_mkldnn
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have init_kernel_type
method, could line 29-148 be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luotao1 Unfortunately, test_with_place
function in test_batch_norm_op.py
file inverts saved_variance
variable because GPU and plain CPU implementations of batch norm do that.
Batch norm operation in the MKLDNN library does not do that. So I had to reimplement test_with_place
function in order to be able to compare saved_variance
from reference batch norm implementation with the one used in MKLDNN.
@tensor-tang Could you have a look at this PR? |
@luotao1 do you have any further remarks regarding this PR? |
How do you think about #9904 (comment) |
@luotao1 I'm sorry for the late response. I've just seen your comment regarding batch norm unit tests. |
|
||
place = core.CPUPlace() | ||
data_format = "NCHW" | ||
test_with_place(place, data_format, [2, 3, 4, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tpatejko I see the only difference between test_with_place
of test_batch_norm_mkldnn_op.py
and test_batch_norm_op.py
is line 146-148:
+ place = core.CPUPlace()
+ data_format = "NCHW"
+ test_with_place(place, data_format, [2, 3, 4, 5])
and
Paddle/python/paddle/fluid/tests/unittests/test_batch_norm_op.py
Lines 392 to 398 in ad91bfe
places = [core.CPUPlace()] | |
if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"): | |
places.append(core.CUDAPlace(0)) | |
for place in places: | |
for data_format in ["NCHW", "NHWC"]: | |
test_with_place(place, data_format, [2, 3, 4, 5]) |
Thus, how about move test_with_place
out of test_forward_backward
in test_batch_norm_op.py
like:
def test_with_place(self, place, data_layout, shape):
....
def test_forward_backward(self):
places = [core.CPUPlace()]
if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"):
places.append(core.CUDAPlace(0))
for place in places:
for data_format in ["NCHW", "NHWC"]:
self.test_with_place(place, data_format, [2, 3, 4, 5])
Then, you can only rewrite test_forward_backward
in test_batch_norm_mkldnn_op.py
, likes:
class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining):
def init_kernel_type(self):
self.use_mkldnn = True
def test_forward_backward(self):
place = core.CPUPlace()
data_format = "NCHW"
self. test_with_place(place, data_format, [2, 3, 4, 5])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between test_with_place
in TestMKLDNNBatchNormOpTraining
and test_with_place
in TestBatchNormOpTraining
is the following:
Reference training test case computes an inverse of saved_variance
:
https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_batch_norm_op.py#L301-L307
This is done because GPU operator returns saved/batch variance already inverted.
MKLDNN implementation of batch normalization does not invert batch variance (saved variance), so in order to be able to compare results of _reference_training
function used in test_with_place
, I had to reuse the code of test_with_place
with aforementioned lines omitted:
https://github.com/tpatejko/Paddle/blob/474fa48b38b4148c5573a8811186e122532af490/python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py#L53-L59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luotao1 One more thing: you are right about moving test_with_place
out of test_backward_forward
and adding data format as a parameter. I did it, but the test cases for MKLDNN were failing because of inverted batch/saved variance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tpatejko Thanks for your explanation. I see it. But how about add a func _reference_training_and_grad
in TestBatchNormOpTraining
likes:
def _reference_training_and_grad(x, scale, bias, epsilon, data_layout):
y, saved_mean, saved_variance = _reference_training(
x, scale, bias, epsilon, data_layout)
mean_out = saved_mean * (1. - momentum) + momentum * mean
variance_out = saved_variance * (1. - momentum
) + momentum * variance
# run backward
y_grad = np.random.random_sample(shape).astype(np.float32)
x_grad, scale_grad, bias_grad = _reference_grad(
x, y_grad, scale, saved_mean, saved_variance, epsilon,
data_layout)
return x_grad, scale_grad, bias_grad
Then, you can only rewrite _reference_training_and_grad
in test_batch_norm_mkldnn_op.py
.
The reason is:
- There are a lot of similar codes in
test_batch_norm_mkldnn_op.py
andtest_batch_norm_op.py
. - It's hard to find out the different MKLDNN implementation of batch normalization which does not invert batch variance (saved variance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luotao1 I think doing reference training in a separate function is a good idea. I will try to implement it.
…ct execution of MKLDNN unit tests
…ges for saved variance not being inverted
@luotao1 Could you have a look at the changes in the unit tests. I refactored them the way you requested. I also added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks very much!
This PR implements MKLDNN batch normalization. It contains: