Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MKLDNN implementation of batch normalization #9904

Merged
merged 26 commits into from
May 3, 2018
Merged

MKLDNN implementation of batch normalization #9904

merged 26 commits into from
May 3, 2018

Conversation

tpatejko
Copy link

This PR implements MKLDNN batch normalization. It contains:

  • MKLDNN batch norm backward and forward passes;
  • support for NCHW data layout
  • unittests for training and inference.

@tpatejko
Copy link
Author

@luotao1 I'm having some troubles with TeamCity builds. The build finishes with the following error message:

[15:41:08][Step 1/1] Traceback (most recent call last):
[15:41:08][Step 1/1]   File "/usr/bin/pip", line 9, in <module>
[15:41:08][Step 1/1]     from pip import main
[15:41:08][Step 1/1] ImportError: cannot import name main
[15:41:08][Step 1/1] The command '/bin/sh -c pip install --upgrade pip &&     pip install -U wheel &&     pip install -U docopt PyYAML sphinx==1.5.6 &&     pip install sphinx-rtd-theme==0.1.9 recommonmark' returned a non-zero code: 1
[15:41:08][Step 1/1] 
[15:41:08][Step 1/1] Process exited with code 1
[15:41:08][Step 1/1] Process exited with code 1
[15:41:08][Step 1/1] Step Build and test (Command Line) failed

Do you know what seems to be the issue?

@luotao1
Copy link
Contributor

luotao1 commented Apr 16, 2018

@tpatejko This bug is duplicated with #9927 and fixed in #9926. You can merge the latest codes.

@tpatejko
Copy link
Author

@luotao1 I'm having some troubles with unit tests in TeamCity. The test that is failing is test_parallel_executor.

The output for the test is as follows:

[11:40:51][Step 1/1]  94/125 Test  #91: test_parallel_executor ..........................***Exception: Other 44.62 sec
[11:40:51][Step 1/1] test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining) ... FAIL
[11:40:51][Step 1/1] test_all (test_parallel_executor.TestCRFModel) ... [171.97684 167.20038]
[11:40:51][Step 1/1] [82.76634 93.07925]
[11:40:51][Step 1/1] [87.40605 84.28883]
[11:40:51][Step 1/1] [81.66299 78.8318 ]
[11:40:51][Step 1/1] [62.7163  97.20565]
[11:40:51][Step 1/1] [84.70265  85.544266]
[11:40:51][Step 1/1] [67.59907 85.60291]
[11:40:51][Step 1/1] [72.08023 69.33337]
[11:40:51][Step 1/1] [63.721405 74.92147 ]
[11:40:51][Step 1/1] [57.358616 63.71672 ]
[11:40:51][Step 1/1] ok
[11:40:51][Step 1/1] test_batchnorm_fc (test_parallel_executor.TestMNIST) ... [2.755311  2.6013417] [0.57221764 0.8664746 ]
[11:40:51][Step 1/1] ERROR
[11:40:51][Step 1/1] test_simple_fc (test_parallel_executor.TestMNIST) ... ERROR
[11:40:51][Step 1/1] test_resnet (test_parallel_executor.TestResnet) ... ERROR
[11:40:51][Step 1/1] test_main (test_parallel_executor.TestTransformer) ... skipped 'transformer is buggy in multi gpu'
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_batchnorm_fc (test_parallel_executor.TestMNIST)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 276, in test_batchnorm_fc
[11:40:51][Step 1/1]     "label": label})
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 228, in check_network_convergence
[11:40:51][Step 1/1]     exe.run([], feed_dict=feed_dict)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/parallel_executor.py", line 145, in run
[11:40:51][Step 1/1]     self.executor.run(fetch_list, fetch_var_name, feed_tensor_dict)
[11:40:51][Step 1/1] EnforceNotMet: an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]
[11:40:51][Step 1/1] PaddlePaddle Call Stacks: 
[11:40:51][Step 1/1] 0       0x7f5acc347d3cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
[11:40:51][Step 1/1] 1       0x7f5acd234e33p paddle::platform::CUDADeviceContext::Wait() const + 515
[11:40:51][Step 1/1] 2       0x7f5acc40f20ep paddle::framework::ParallelExecutor::Run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&) + 766
[11:40:51][Step 1/1] 3       0x7f5acc39c6b3p void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, paddle::framework::ParallelExecutor, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::ParallelExecutor::*)(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&)#1}, void, paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, paddle::framework::ParallelExecutor, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::ParallelExecutor::*)(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&)#1}&&, void (*)(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 451
[11:40:51][Step 1/1] 4       0x7f5acc362234p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 1236
[11:40:51][Step 1/1] 5             0x4c37edp PyEval_EvalFrameEx + 31165
[11:40:51][Step 1/1] 6             0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 7             0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 8             0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 9             0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 10            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 11            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 12            0x4d55f3p
[11:40:51][Step 1/1] 13            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 14            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 15            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 16            0x4d54b9p
[11:40:51][Step 1/1] 17            0x4eebeep
[11:40:51][Step 1/1] 18            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 19            0x548253p
[11:40:51][Step 1/1] 20            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 21            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 22            0x4d55f3p
[11:40:51][Step 1/1] 23            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 24            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 25            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 26            0x4d54b9p
[11:40:51][Step 1/1] 27            0x4eebeep
[11:40:51][Step 1/1] 28            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 29            0x548253p
[11:40:51][Step 1/1] 30            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 31            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 32            0x4d55f3p
[11:40:51][Step 1/1] 33            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 34            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 35            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 36            0x4d54b9p
[11:40:51][Step 1/1] 37            0x4eebeep
[11:40:51][Step 1/1] 38            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 39            0x548253p
[11:40:51][Step 1/1] 40            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 41            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 42            0x4d55f3p
[11:40:51][Step 1/1] 43            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 44            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 45            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 46            0x4d54b9p
[11:40:51][Step 1/1] 47            0x4eebeep
[11:40:51][Step 1/1] 48            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 49            0x548253p
[11:40:51][Step 1/1] 50            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 51            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 52            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 53            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 54            0x4d55f3p
[11:40:51][Step 1/1] 55            0x4eebeep
[11:40:51][Step 1/1] 56            0x4ee7f6p
[11:40:51][Step 1/1] 57            0x4aa9abp
[11:40:51][Step 1/1] 58            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 59            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 60            0x4bfa8dp PyEval_EvalFrameEx + 15453
[11:40:51][Step 1/1] 61            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 62            0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 63            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 64            0x4d54b9p
[11:40:51][Step 1/1] 65            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 66            0x519a46p
[11:40:51][Step 1/1] 67            0x493b06p Py_Main + 1590
[11:40:51][Step 1/1] 68      0x7f5b0153e830p __libc_start_main + 240
[11:40:51][Step 1/1] 69            0x4933e9p _start + 41
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_simple_fc (test_parallel_executor.TestMNIST)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 261, in test_simple_fc
[11:40:51][Step 1/1]     self.check_network_convergence(simple_fc_net)
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 218, in check_network_convergence
[11:40:51][Step 1/1]     startup_exe.run(startup)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/executor.py", line 336, in run
[11:40:51][Step 1/1]     self.executor.run(program.desc, scope, 0, True, True)
[11:40:51][Step 1/1] RuntimeError: function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_resnet (test_parallel_executor.TestResnet)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 305, in test_resnet
[11:40:51][Step 1/1]     batch_size=batch_size)
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 218, in check_network_convergence
[11:40:51][Step 1/1]     startup_exe.run(startup)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/executor.py", line 336, in run
[11:40:51][Step 1/1]     self.executor.run(program.desc, scope, 0, True, True)
[11:40:51][Step 1/1] RuntimeError: function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] FAIL: test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 507, in test_parallel_testing
[11:40:51][Step 1/1]     str(test_loss))
[11:40:51][Step 1/1] AssertionError: Train loss: [2.8382177 2.20449  ]
[11:40:51][Step 1/1]  Test loss:[2.8382177 2.7273917]
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Ran 6 tests in 17.689s
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] FAILED (failures=1, errors=3, skipped=1)
[11:40:51][Step 1/1] terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
[11:40:51][Step 1/1]   what():  an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]
[11:40:51][Step 1/1] PaddlePaddle Call Stacks: 
[11:40:51][Step 1/1] 0       0x7f5acc347d3cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
[11:40:51][Step 1/1] 1       0x7f5acd234e33p paddle::platform::CUDADeviceContext::Wait() const + 515
[11:40:51][Step 1/1] 2       0x7f5acd23507bp paddle::platform::CUDADeviceContext::~CUDADeviceContext() + 75
[11:40:51][Step 1/1] 3       0x7f5acd235881p paddle::platform::CUDADeviceContext::~CUDADeviceContext() + 17
[11:40:51][Step 1/1] 4       0x7f5acc43c50ep paddle::operators::reader::DoubleBufferReader::~DoubleBufferReader() + 62
[11:40:51][Step 1/1] 5       0x7f5acc345270p paddle::framework::Variable::PlaceholderImpl<paddle::framework::ReaderHolder>::~PlaceholderImpl() + 48
[11:40:51][Step 1/1] 6       0x7f5acd0deafcp paddle::framework::Scope::~Scope() + 188
[11:40:51][Step 1/1] 7       0x7f5acd0de9f1p paddle::framework::Scope::DropKids() + 49
[11:40:51][Step 1/1] 8       0x7f5acd0dea6ap paddle::framework::Scope::~Scope() + 42
[11:40:51][Step 1/1] 9       0x7f5acc34421ap pybind11::class_<paddle::framework::Scope>::dealloc(_object*) + 58
[11:40:51][Step 1/1] 10      0x7f5acc35bb2dp pybind11_object_dealloc + 45
[11:40:51][Step 1/1] 11            0x4fc33ap _PyModule_Clear + 1354
[11:40:51][Step 1/1] 12            0x4fbc2ep PyImport_Cleanup + 990
[11:40:51][Step 1/1] 13            0x4f8e14p Py_Finalize + 132
[11:40:51][Step 1/1] 14            0x51dc18p Py_Exit + 8
[11:40:51][Step 1/1] 15            0x51b1b7p
[11:40:51][Step 1/1] 16            0x51aaddp PyErr_PrintEx + 45
[11:40:51][Step 1/1] 17            0x519a53p
[11:40:51][Step 1/1] 18            0x493b06p Py_Main + 1590
[11:40:51][Step 1/1] 19      0x7f5b0153e830p __libc_start_main + 240
[11:40:51][Step 1/1] 20            0x4933e9p _start + 41
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] *** Aborted at 1523965227 (unix time) try "date -d @1523965227" if you are using GNU date ***
[11:40:51][Step 1/1] PC: @                0x0 (unknown)
[11:40:51][Step 1/1] *** SIGABRT (@0x610a) received by PID 24842 (TID 0x7f5b01d1e700) from PID 24842; stack trace: ***
[11:40:51][Step 1/1]     @     0x7f5b018f9390 (unknown)
[11:40:51][Step 1/1]     @     0x7f5b01553428 gsignal
[11:40:51][Step 1/1]     @     0x7f5b0155502a abort
[11:40:51][Step 1/1]     @     0x7f5af805184d __gnu_cxx::__verbose_terminate_handler()
[11:40:51][Step 1/1]     @     0x7f5af804f6b6 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af804e6a9 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af804f005 __gxx_personality_v0
[11:40:51][Step 1/1]     @     0x7f5af8573f83 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af8574487 _Unwind_Resume
[11:40:51][Step 1/1]     @     0x7f5acd234fd6 paddle::platform::CUDADeviceContext::Wait()
[11:40:51][Step 1/1]     @     0x7f5acd23507b paddle::platform::CUDADeviceContext::~CUDADeviceContext()
[11:40:51][Step 1/1]     @     0x7f5acd235881 paddle::platform::CUDADeviceContext::~CUDADeviceContext()
[11:40:51][Step 1/1]     @     0x7f5acc43c50e paddle::operators::reader::DoubleBufferReader::~DoubleBufferReader()
[11:40:51][Step 1/1]     @     0x7f5acc345270 paddle::framework::Variable::PlaceholderImpl<>::~PlaceholderImpl()
[11:40:51][Step 1/1]     @     0x7f5acd0deafc paddle::framework::Scope::~Scope()
[11:40:51][Step 1/1]     @     0x7f5acd0de9f1 paddle::framework::Scope::DropKids()
[11:40:51][Step 1/1]     @     0x7f5acd0dea6a paddle::framework::Scope::~Scope()
[11:40:51][Step 1/1]     @     0x7f5acc34421a pybind11::class_<>::dealloc()
[11:40:51][Step 1/1]     @     0x7f5acc35bb2d pybind11_object_dealloc
[11:40:51][Step 1/1]     @           0x4fc33a _PyModule_Clear
[11:40:51][Step 1/1]     @           0x4fbc2e PyImport_Cleanup
[11:40:51][Step 1/1]     @           0x4f8e14 Py_Finalize
[11:40:51][Step 1/1]     @           0x51dc18 Py_Exit
[11:40:51][Step 1/1]     @           0x51b1b7 (unknown)
[11:40:51][Step 1/1]     @           0x51aadd PyErr_PrintEx
[11:40:51][Step 1/1]     @           0x519a53 (unknown)
[11:40:51][Step 1/1]     @           0x493b06 Py_Main
[11:40:51][Step 1/1]     @     0x7f5b0153e830 __libc_start_main
[11:40:51][Step 1/1]     @           0x4933e9 _start
[11:40:51][Step 1/1]     @                0x0 (unknown)
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1]         Start  94: test_mul_op

Some of these tests seem to be failing because of incorrect memory accesses, but some of them seem to calculate results incorrectly. Some of them seem to be related to batch normalization.

I did some changes in plain batch norm, but I haven't touched the plain or GPU implementations of batch normalization.

Do you know the reason why these tests are failing?

PS. I noticed there is an issue with parallel_executor in the title #9984. Could it be related to the issues above?

@luotao1 luotao1 added the Intel label Apr 17, 2018
@luotao1
Copy link
Contributor

luotao1 commented Apr 17, 2018

Do you know the reason why these tests are failing? PS. I noticed there is an issue with parallel_executor in the title #9984. Could it be related to the issues above?

Yes, our test_parallel_executor fails randomly, you can re-run your commit at first.

@tpatejko
Copy link
Author

@luotao1 Thanks for your comment. One more question, how do test_parallel_executor tests fail?

I can see two different types of failures:

  1. This one seems to be related to how computations are carried out:
[11:40:51][Step 1/1] test_batchnorm_fc (test_parallel_executor.TestMNIST) ... [2.755311  2.6013417] [0.57221764 0.8664746 ]
[11:40:51][Step 1/1] ERROR

or this one:

[11:40:51][Step 1/1] FAIL: test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 507, in test_parallel_testing
[11:40:51][Step 1/1]     str(test_loss))
[11:40:51][Step 1/1] AssertionError: Train loss: [2.8382177 2.20449  ]
[11:40:51][Step 1/1]  Test loss:[2.8382177 2.7273917]
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ----------------------------------------------------------------------
  1. This one seems to be related to incorrect memory access in GPU device context
[11:40:51][Step 1/1] FAILED (failures=1, errors=3, skipped=1)
[11:40:51][Step 1/1] terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
[11:40:51][Step 1/1]   what():  an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]

@tpatejko tpatejko requested a review from luotao1 April 17, 2018 15:56
@tpatejko
Copy link
Author

@luotao1 Could you have a look at the code, or point out someone who could review this PR?

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tensor-tang Could you help review batch_norm_mkldnn_op.cc?

from test_batch_norm_op import TestBatchNormOpInference, TestBatchNormOpTraining, _reference_training, _reference_grad


class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining):
Copy link
Contributor

@luotao1 luotao1 Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the batch_norm test like this:

class TestMKLDNN(TestConv2dOp):
def init_kernel_type(self):
self.use_mkldnn = True

Only set init_kernel_type, which will run mkldnn_batch_norm kernel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected this part. I introduced init_kernel_type method in batch norm test cases, that set use_mkldnn variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have init_kernel_type method, could line 29-148 be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 Unfortunately, test_with_place function in test_batch_norm_op.py file inverts saved_variance variable because GPU and plain CPU implementations of batch norm do that.

Batch norm operation in the MKLDNN library does not do that. So I had to reimplement test_with_place function in order to be able to compare saved_variance from reference batch norm implementation with the one used in MKLDNN.

@tpatejko
Copy link
Author

@tensor-tang Could you have a look at this PR?

@tensor-tang
Copy link
Contributor

hi @tpatejko and @luotao1, since I am taking annual leave this week outside, I can only have a quick look on the file batch_norm_mkldnn_op.cc, the logic looks ok to me, @luotao1 could you please help double check and some other items, thanks.

@tpatejko
Copy link
Author

@luotao1 do you have any further remarks regarding this PR?

@luotao1
Copy link
Contributor

luotao1 commented Apr 26, 2018

How do you think about #9904 (comment)

@tpatejko
Copy link
Author

@luotao1 I'm sorry for the late response. I've just seen your comment regarding batch norm unit tests.


place = core.CPUPlace()
data_format = "NCHW"
test_with_place(place, data_format, [2, 3, 4, 5])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpatejko I see the only difference between test_with_place of test_batch_norm_mkldnn_op.py and test_batch_norm_op.py is line 146-148:

+        place = core.CPUPlace()
+        data_format = "NCHW"
+        test_with_place(place, data_format, [2, 3, 4, 5])

and

places = [core.CPUPlace()]
if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"):
places.append(core.CUDAPlace(0))
for place in places:
for data_format in ["NCHW", "NHWC"]:
test_with_place(place, data_format, [2, 3, 4, 5])

Thus, how about move test_with_place out of test_forward_backward in test_batch_norm_op.py like:

def test_with_place(self, place, data_layout, shape):
   ....
def test_forward_backward(self):
        places = [core.CPUPlace()]
        if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"):
            places.append(core.CUDAPlace(0))

        for place in places:
            for data_format in ["NCHW", "NHWC"]:
                self.test_with_place(place, data_format, [2, 3, 4, 5])

Then, you can only rewrite test_forward_backward in test_batch_norm_mkldnn_op.py, likes:

class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining):
  def init_kernel_type(self):
       self.use_mkldnn = True
  def test_forward_backward(self):
       place = core.CPUPlace()
       data_format = "NCHW"
       self. test_with_place(place, data_format, [2, 3, 4, 5])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between test_with_place in TestMKLDNNBatchNormOpTraining and test_with_place in TestBatchNormOpTraining is the following:
Reference training test case computes an inverse of saved_variance:
https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_batch_norm_op.py#L301-L307

This is done because GPU operator returns saved/batch variance already inverted.

MKLDNN implementation of batch normalization does not invert batch variance (saved variance), so in order to be able to compare results of _reference_training function used in test_with_place, I had to reuse the code of test_with_place with aforementioned lines omitted:
https://github.com/tpatejko/Paddle/blob/474fa48b38b4148c5573a8811186e122532af490/python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py#L53-L59

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 One more thing: you are right about moving test_with_place out of test_backward_forward and adding data format as a parameter. I did it, but the test cases for MKLDNN were failing because of inverted batch/saved variance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpatejko Thanks for your explanation. I see it. But how about add a func _reference_training_and_grad in TestBatchNormOpTraining likes:

def _reference_training_and_grad(x, scale, bias, epsilon, data_layout):
           y, saved_mean, saved_variance = _reference_training(
                x, scale, bias, epsilon, data_layout)
            mean_out = saved_mean * (1. - momentum) + momentum * mean
            variance_out = saved_variance * (1. - momentum
                                             ) + momentum * variance
            # run backward
            y_grad = np.random.random_sample(shape).astype(np.float32)
            x_grad, scale_grad, bias_grad = _reference_grad(
                x, y_grad, scale, saved_mean, saved_variance, epsilon,
                data_layout)
            return x_grad, scale_grad, bias_grad

Then, you can only rewrite _reference_training_and_grad in test_batch_norm_mkldnn_op.py.
The reason is:

  • There are a lot of similar codes in test_batch_norm_mkldnn_op.py and test_batch_norm_op.py.
  • It's hard to find out the different MKLDNN implementation of batch normalization which does not invert batch variance (saved variance).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 I think doing reference training in a separate function is a good idea. I will try to implement it.

@tpatejko
Copy link
Author

tpatejko commented May 2, 2018

@luotao1 Could you have a look at the changes in the unit tests. I refactored them the way you requested.

I also added use_mkldnn attribute to the batch norm's Python interface.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks very much!

@luotao1 luotao1 merged commit 4a497b8 into PaddlePaddle:develop May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants