MKLDNN-DO Subgraph Optimization #966

sreekanth-yalachigere · 2019-05-03T17:27:36Z

This optimization creates a subgraph of mkldnn operators and uses full potential of mkldnn by using block propagation and improves performance significantly.
For example, in Resnet50, currently we have 109 data reorders.

onnxruntime_data(nchw) ->reorder(nchw8c)->conv(nchw8c) -> reorder(nchw);
onnxruntime_data(nchw) ->batchnorm(nchw);
onnxruntime_data(nchw) ->Relu(nchw);
onnxruntime_data(nchw) ->reorder(nchw8c)->pool(nchw8c) -> reorder(nchw);
onnxruntime_data(nchw) ->reorder(nchw8c)->conv(nchw8c) -> reorder(nchw);

With this optimization, we will have only one data reorder and we propage blocked mkldnn memory all the way to the end of subgraph.

onnxruntime_data(nchw) -> conv(nchw8c) -> batchnorm-relu(nchw8c)->pool(nchw8c)->conv(nchw8c)... reorder(nchw)

Subgraph optimization can be enabled by setting the following environment variable.
ORT_MKLDNN_SUBGRAPH=1

…runtime into mkl13

jywu-msft · 2019-05-03T17:28:54Z

/azp run

azure-pipelines · 2019-05-03T17:30:34Z

Azure Pipelines successfully started running 10 pipeline(s).

pranavsharma

Mkl and MKL are used inconsistently.
Put the MklKernel code in a cc file (not header).
Prefer using mkldnn instead of mkl everywhere to avoid confusion with mkl library.

onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc

onnxruntime/core/providers/mkldnn/subgraph/subgraph.h

onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc

onnxruntime/core/providers/mkldnn/subgraph/mkldnn_custom_op.h

pranavsharma · 2019-05-03T22:51:28Z

onnxruntime/core/providers/mkldnn/subgraph/mkl_kernel.h

+    ORT_UNUSED_PARAMETER(attributes_prefix);
+  }
+
+  virtual Status CreatePrimitives(const ONNXRunTimeTensor* input_tensors,


input_tensor

I think input_tensors is correct. We access shape, dim and data by an index. Can you please confirm?

pranavsharma · 2019-05-03T23:08:27Z

onnxruntime/core/providers/mkldnn/subgraph/mkldnn_custom_op.h

+      mkldnn::memory::dims src_dims(x_shape.GetDims().begin(), x_shape.GetDims().end());
+      std::string key;
+      key.reserve(128);
+      key = subgraph_key_;


Can AddDimsToKey be written such that it accepts a const string and returns the newly formed key? Moreover, we should rename this to GenerateKey() as addition is just one way to generate the key.

@pranavsharma AddDimsToKey is widely used in Vanilla mkldnn operators. Should I go ahead and modify it?

onnxruntime/core/providers/mkldnn/subgraph/mkldnn_custom_op.h

onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc

onnxruntime/core/providers/mkldnn/subgraph/MklConv.h

onnxruntime/core/providers/mkldnn/subgraph/MklSum.h

jywu-msft · 2019-05-06T20:47:23Z

can you add some high level documentation, which describes the classes, abstractions and their relationships.
this would help maintenance of this code moving forward.

This optimization creates a subgraph of mkldnn operators and uses full potential of mkldnn by using block propagation and improves performance significantly.
For example, in Resnet50, currently we have 109 data reorders.
onnxruntime_data(nchw) ->reorder(nchw8c)->conv(nchw8c) -> reorder(nchw);
onnxruntime_data(nchw) ->batchnorm(nchw);
onnxruntime_data(nchw) ->Relu(nchw);
onnxruntime_data(nchw) ->reorder(nchw8c)->pool(nchw8c) -> reorder(nchw);
onnxruntime_data(nchw) ->reorder(nchw8c)->conv(nchw8c) -> reorder(nchw);
With this optimization, we will have only one data reorder and we propage blocked mkldnn memory all the way to the end of subgraph.

onnxruntime_data(nchw) -> conv(nchw8c) -> batchnorm-relu(nchw8c)->pool(nchw8c)->conv(nchw8c)... reorder(nchw)

Subgraph optimization can be enabled by setting the following environment variable.
ORT_MKLDNN_SUBGRAPH=1

onnxruntime/core/providers/mkldnn/subgraph/subgraph.h

onnxruntime/core/providers/mkldnn/subgraph/MklBatchNorm.h

onnxruntime/core/providers/mkldnn/subgraph/mkldnn_custom_op.h

jywu-msft · 2019-05-06T22:34:32Z

onnxruntime/core/providers/mkldnn/subgraph/mkl_kernel.h

+  }
+  virtual ~MklKernel(){};
+
+  virtual void ReadAttributes(const std::unordered_map<std::string,


is it possible to leverage OpNodeProtoHelper for reading attributes?
then you wouldn't need to define all the Get*Attr() methods below.

I looked into it and it's not straight forward. Can this be done after this PR?

jywu-msft · 2019-05-07T15:16:55Z

onnxruntime/core/providers/mkldnn/subgraph/subgraph.h

+
+namespace onnxruntime {
+
+struct MklNode {


there's already onnxruntime representations of Node, subgraphs etc.
Were those too heavy to use? Would have been nice to leverage more of onnxruntime classes where possible,
rather than adding new ones to bridge onnxruntime and mkldnn library.

MklDnnNode is totally different and I am using it for MklDnn IR of a subgraph.

yufenglee · 2019-05-07T18:49:15Z

I tried this PR on a production model and got a crash. Here is the call stack:

#0 0x000000000047e487 in std::__find_if<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >, __gnu_cxx::__ops::_Iter_equals_val<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const> > (__first="Conv",
__last=<error: Cannot access memory at address 0x8>, __pred=...) at /usr/include/c++/5/bits/stl_algo.h:118
#1 0x00000000004935ab in std::__find_if<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >, __gnu_cxx::__ops::_Iter_equals_val<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const> > (__pred=..., __last=...,
__first=...) at /usr/include/c++/5/bits/stl_algo.h:162
#2 std::find<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits, std::allocator >*, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > (__val=..., __last=..., _first=...) at /usr/include/c++/5/bits/stl_algo.h:3791
#3 onnxruntime::MKLDNNExecutionProvider::GetCapability (this=this@entry=0xe088f0, graph_viewer=..., kernel_registries=std::vector of length 1, capacity 1 = {...})
at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc:167
#4 0x0000000000788391 in onnxruntime::GraphPartitioner::Partition (this=this@entry=0x7fffffffbf50, graph=..., export_dll=false, func_mgr=...) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/core/framework/graph_partitioner.cc:151
#5 0x00000000004663ee in onnxruntime::InferenceSession::TransformGraph (this=this@entry=0xe07150, graph=..., graph_transformer_mgr=..., providers=..., kernel_registry_manager=..., insert_cast_transformer=..., session_state=...)
at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/core/session/inference_session.cc:301
#6 0x000000000046b7ce in onnxruntime::InferenceSession::Initialize (this=this@entry=0xe07150) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/core/session/inference_session.cc:446
#7 0x0000000000450122 in OrtCreateSession (env=env@entry=0xe03ac0, model_path=0xcd9a50 "/home/yufeng/data/models/fastrcnn/perf/model.onnx", options=options@entry=0xe06cb0, out=out@entry=0x7fffffffd830)
at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/core/session/onnxruntime_c_api.cc:390
#8 0x0000000000421903 in onnxruntime::SessionOptionsWrapper::OrtCreateSession (model_path=, this=) at /home/yufeng/project/mkldnn/onnxruntime/include/onnxruntime/core/session/onnxruntime_cxx_api.h:127
#9 onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession (this=0xdfe920, env=0xe03ac0, performance_test_config=..., m=0xdfdf20) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/test/perftest/ort_test_session.cc:83
#10 0x0000000000423c41 in onnxruntime::perftest::CreateSession (test_model_info=0xdfdf20, performance_test_config=..., env=0xe03ac0) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/test/perftest/performance_runner.cc:188
#11 onnxruntime::perftest::PerformanceRunner::PerformanceRunner (this=0x7fffffffdf30, env=0xe03ac0, test_config=...) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/test/perftest/performance_runner.cc:201
#12 0x0000000000420bbe in real_main (argc=, argv=, p_env=p_env@entry=0x7fffffffe160) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/test/perftest/main.cc:33
#13 0x0000000000419615 in main (argc=, argv=) at /home/yufeng/project/mkldnn/onnxruntime/onnxruntime/test/perftest/main.cc:53

yufenglee · 2019-05-07T18:57:57Z

onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc

+  while (node_index < graph_viewer.MaxNodeIndex()) {
+    auto node = graph_viewer.GetNode(node_index);
+    std::vector<std::string>::iterator it = std::find(
+        mkl_ops.begin(), mkl_ops.end(), node->OpType());


the crash is because node is nullptr. Need to figure out why node is null.

@yufenglee is this PR model available on ONNX Model Zoo?

@sreekanth-yalachigere , no, it is a production model and I can't share it with you for confidentiality.

GetNode() may return nullptr if the node has been freed , which can happen if there are some transforms.
I don't think you can iterate through nodes by index from 0 to max.
Shouldn't you use GetNodesInTopologicalOrder() ?

…private

…nn_kernel.cpp

sreekanth-yalachigere · 2019-05-09T22:47:56Z

@pranavsharma @jywu-msft

Changed Mkl to MklDnn
Renamed files (mkl_* to mkldnn_*)
Moved code from mkldnn_kernel.h to new mkldnn_kernel.cc
Extracted method from GetCapability long code.

…runtime into mkl13

sreekanth-yalachigere · 2019-05-28T18:12:27Z

Creating new PR as many files have modified after memcpy code merge.

sreekanth-yalachigere added 19 commits April 21, 2019 21:24

PR Branch

ccc7b57

PR review changes

bdcfc9b

bug fixes

0da8a24

PR review and bug fix

c9c51c8

PR review and bug fix

77b1090

PR review and bug fix2

b2ec434

mkldnncustomop.h renamed compute to submit

0587283

mklkernel: renamed compute to Bind

0b232d2

shuffle net perf bug fix

72936a3

PR Branch

23a412f

PR review changes

1054df9

bug fixes

b83a507

PR review and bug fix

29c9dc2

PR review and bug fix

fe8e6fa

PR review and bug fix2

23ed5d8

mkldnncustomop.h renamed compute to submit

5d65d7d

mklkernel: renamed compute to Bind

7cbccfd

shuffle net perf bug fix

60d17e5

Merge branch 'mkl13' of https://github.com/sreekanth-yalachigere/onnx…

9e656c0

…runtime into mkl13

sreekanth-yalachigere requested a review from a team as a code owner May 3, 2019 17:27

pranavsharma reviewed May 3, 2019

View reviewed changes