Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) #5916

xadupre · 2020-11-24T19:45:00Z

Description:
See #5550, #5911.

BEFORE

AFTER

On this example, the ratio new implementation / previous implementation is ~ 0.6 (for all N). The comparison with torch and numpy is difficult as the transpose only permutes dimensions and strides but without changing the data.

Measures on model bert_en_cased_L-12_H-768_A-12_2

BEFORE

TF perf, ONNX perf, ratio
1036.9153499603271 3631.0171127319336 0.28557159544207283

AFTER

TF perf, ONNX perf, ratio
1008.159875869751 1358.5475444793701 0.7420865614652473

Motivation and Context
See #5550.

…o transpose

…o transpose2

hariharans29 · 2020-11-25T01:20:16Z

onnxruntime/test/providers/cpu/tensor/transpose_test.cc

@@ -246,6 +246,39 @@ TEST(TransposeOpTest, ThreeDimSuffix) {
  TransposeTest(input_shape, input_vals, &perm, expected_shape, expected_vals, false);  //TensorRT: illegal error
 }

+TEST(TransposeOpTest, TransposeReshape) {


Can we add test(s) for the kind of inputs that would have caused an issue previously (i.e.) the reason for which the PR as reverted ?

Done. I checked this test failed with the previous version.

Sorry - Did this specific test fail with the old version ? It seems to be the same as the old checked-in test....

Yes it fails with the old version. The test I mention is here. https://github.com/microsoft/onnxruntime/pull/5916/files#diff-fc7da1e84aadf77e977b26bbf56319b4cf1ffb9ae7fbfdc539ab84cb2e971582R517 It creates a buffer bigger than the expected output and fills it with a constant. After the function is called, I checks the constant did not change outside the expected output.

snnn · 2020-11-25T10:00:30Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+  size_t index;
+  size_t upper_bound;
+  int64_t stride;
+  MultiIndex() {}


In the default constructor, you leave all the variables uninitialised.

Yes, to save time. The default constructor is called when creating an array. Even if they were initialized, they would be overwritten after. This index is not exposed in the header, I don't expect this index to be used somewhere else in the code.

Then you should have an init function instead of using operator= to fill it?

nit: the 'typedef' isn't needed.

also a comment capturing why the values are not explicitly initialized would be helpful to prevent someone innocently changing that in the future causing a perf regression.

I am good with this change. Probably needs sign-off from @skottmckay.

Was just waiting to see if there were any perf issues from the ORT_ENFORCE in the loop. For this release I'm not convinced we need that given the issue wasn't the calculations picking a bad offset, so if there's an issue we can back that out.

I could not measure the differnce between with/without ORT_ENFORCE in the loops. It was not significant so I let them.

OK. I would prefer for DoTransposeEltWise to not be public given this bounds check, but that can be a separate PR.

snnn · 2020-11-25T10:08:06Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+  const uint8_t* local_source = source;
+  for (size_t i = 0; i < num_blocks; ++i) {
+    CopyPrim<T>(target, local_source);
+    IncrementIndexAndComputeOffset(mindex.data(), naxes, local_source);


I think the most difficult part is how would you prove this code is correct? Here the local_source pointer may get moved randomly, how would you prove it won't be out of bound?

I think the unit test did not catch this issue because it was compiled with pointer alignment but there should be a way to test that if one value is copied after an another one but at a lower position in the transposed array.

I added a unit test to check that this function does not go outside the output vector.

snnn · 2020-11-25T10:10:42Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+  for (size_t i = 0; i < num_blocks; ++i) {
+    CopyPrim<T>(target, local_source);
+    IncrementIndexAndComputeOffset(mindex.data(), naxes, local_source);
+    target += sizeof(T);


If I understand correctly, target should be T*.
And the for loop would be like:

T* target_end = target+num_blocks; for(T* p=target; p!=target_end;++p){ .... }

Yes, we could save another variable.

snnn · 2020-11-25T10:15:27Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

-    index[k] = 0;
+// Combines multi-index increment and corresponding pointer in the string tensor to transpose.
+static void IncrementIndexAndComputeOffset(MultiIndex* mindex, size_t naxes, const std::string*& local_source) {
+  MultiIndex* it = mindex + (naxes - 1);


naxes is size_t type. naxes - 1 is dangerous unless you explicitly say naxes won't be zero.

I added parenthesis to get (mindex + naxes) - 1, this should solve this issue.

snnn · 2020-11-25T10:21:29Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

@@ -560,6 +569,20 @@ static bool IsMovingSingleAxis(const std::vector<size_t>& permutations, size_t&
  return single_axis_moved;
 }

+bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) {


This function can be tested individually. It would be better if there are some unit tests for it.
And, what if perm contains invalid values? Can you trust the input?

Is there an existing function that can be reused at there?

I would need to expose that function in the header then.

fwiw in the internal usage the values of perm are validated earlier. by making it public it changes what could in theory be passed (potentially leading to someone adding unnecessary code to re-validate perm for the theoretical but non-existent use-case).

snnn · 2020-11-25T10:33:39Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+  local_source += it->stride;
+  if (++it->index < it->upper_bound)
+    return;
+  local_source -= it->stride * it->index;


Why local_source still points to valid memory after these movements?

The previous code was this one:

for (int64_t j = 0; j < num_axes; ++j) { offset += index[j] * stride[j];

and local_source = source + offset.

The offset was computed after every iteration but most of time, offset is incremented (local_source += it->stride). If the transposed matrix has 4 dimensions, this loop does one increment and a copy most of the time. The previous loop was computing the offset each time (4 multiplications and additions) and then was copying the value to its transposed location. offset and local_source are the same but local_source is obtained with less operations. I could add some verifications but I would prefer to do it only in debug mode.

…de boundaries

onnxruntime/core/providers/cpu/tensor/transpose.cc

skottmckay · 2020-12-01T01:17:14Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+template <class T>
+static void TypedDoTransposeEltWise(int64_t num_axes, const std::vector<int64_t>& target_dims, size_t num_blocks,
+                                    const std::vector<size_t>& stride, const uint8_t* source, uint8_t* target) {
+  std::vector<MultiIndex> mindex(num_axes);


If this took args for the end of the source and target data, could we do an explicit check we haven't gone out of bounds? Would it be possible to just do that check once at the end of the 'for' loop that copies?

That would catch any future issues, and we wouldn't need the hack to expose DoTransposeEltWise to a unit test.

I need to measure how much time would be spent in the testing before answering. Not sure checking at the end of the loop would work. The loop could go outside then go back inside. In the previous version, the pointer stayed all the time within boundaries. It created a failure because instead of writing 1 byte, it was writing 8 bytes at a time: 1 byte inside, 7 bytes outside for the last affectation. It did not fail on my machine because the compilation aligned the pointer to 8 bytes and the pointer on the target buffer always increases.

I don't quite follow that. the sizes of the blocks to write should be based on the input dimensions, so how do we get a mismatch where it's attempting to write 8 bytes for a 1 byte block? that sounds more like a bug figuring out block sizes and the check should be where we determine that.

I went back and looked at the old code and it seemed the bug was hardcoding a type in this call that had no relation to the data type being copied.
So this:
CopyPrim<uint64_t>(target, local_source);
is now
CopyPrim<T>(target, local_source);

Is that correct?

If so there was never an incorrect calculation of an offset that violated the bounds was there? It was just that the copy wasn't limited to the correct number of bytes.

That's correct. I replaced it by a template to reduce the duplication of code.

Based on that I'm not convinced we need to go all out to validate the bounds. There was never an issue with those - it was simply the use of the wrongly sized type in the copy.

Instead is it possible to craft a unit test that broke with the incorrect copy size without explicitly checking the bounds? you're writing 8 bytes in places where 1 should be written, so I would expect with carefully chosen input shape and transpose perms you could cause a later write to overwrite some of an earlier one leading to the output not matching the expected values. that would prevent regression without needing extra ORT_ENFORCE statements in the loop or exposing internal helper methods just for a unit test.

xadupre

Because I exposed this function in the header in order to be able to write a unit test on that specific function. Should I put it back?

skottmckay · 2020-12-01T09:05:06Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+static void IncrementIndexAndComputeOffset(MultiIndex* mindex, size_t naxes, const T*& local_source) {
+  MultiIndex* it = (mindex + naxes) - 1;
+  local_source += it->stride;
+  if (++it->index < it->upper_bound)


It would really help to have some comments on a number of lines here explaining the logic as there's a lot of intrinsic knowledge required to understand the steps. Would help detect potential issues as well.

I added more comments. I'll make some benchmark today to measure the impact of the instructions ORT_ENFORCE inside the loops involving indices.

…o transpose2

hariharans29 · 2020-12-01T23:06:47Z

I am good with this change. Probably needs sign-off from @skottmckay.

skottmckay · 2020-12-01T23:10:02Z

I am good with this change. Probably needs sign-off from @skottmckay.

Was just waiting to see if there were any perf issues from the ORT_ENFORCE in the loop. For this release I'm not convinced we need that given the issue wasn't the calculations picking a bad offset, so if there's an issue we can back that out.

xadupre · 2020-12-02T12:14:57Z

Last change: I added two unit tests today because I noticed two functions (DoTransposeImpl) were not covered by a unit test.

hariharans29 · 2020-12-02T22:08:19Z

Last change: I added two unit tests today because I noticed two functions (DoTransposeImpl) were not covered by a unit test.

Are we keeping the ORT_ENFORCEs in the loops ?

xadupre · 2020-12-02T23:37:48Z

Are we keeping the ORT_ENFORCEs in the loops ?

I kept them because they don't harm the performance.

…ranspose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl

* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl * Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965) * improves processing time by 10 * extend coverage unit test coverage * better implementation for the multi regression case * better comment, keep parallelization by trees when not enough trees * Initialize a structure in operator ReduceSum (#6005) * fix initialisation issue * Fuse MatMulIntegerToFloat only when scales are scalar (#6008) MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully. * Disable Python 3.9 for training Python packaging build. (#6012) Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency. * Fix bugs for 1: Calibrator should check model inputs; 2: (#6017) quantize_inupts forgot to use parameter initializer_use_weight_qtyp. * Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1. - [Release notes](https://github.com/highlightjs/highlight.js/releases) - [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md) - [Commits](highlightjs/highlight.js@10.2.1...10.4.1) Signed-off-by: dependabot[bot] <support@github.com> * work around of the build break in mac (#6069) * Fix the build break in macos release * revert android change * Bump up API version for 1.6 release (#6076) * Update version to 1.6.0 (#6041) * Update version to 1.6.0 * Add v 1.5.3 info * Updating WindowsAI and ONNX version Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" This reverts commit beb950e. Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Zhang Lei <zhang.huanning@hotmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pranav Sharma <prs@microsoft.com> Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

sdpython added 18 commits October 20, 2020 17:31

Improves implementation of transpose operator

d8fd2ea

use alloca instead of std::vector

59b8014

Improves all transpose function, uses stack to allocate

265beac

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

a963144

…o transpose

fix comment

621a5a2

Simplifies transposition when it is not really needed.

ed176f1

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

7263d87

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

b7dfa05

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

acff690

…o transpose

fix comments

a3a3070

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

5bb6c09

…o transpose

Removes ORT_ENFORCE

4d3c7d5

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

55e8bde

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

333fb40

…o transpose

Removes alloca

0438379

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

ff3f3ac

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

048bda5

…o transpose2

Fix issue mentioned in microsoft#5911

3a0cdc8

xadupre requested a review from a team as a code owner November 24, 2020 19:45

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

0f44d34

…o transpose2

xadupre changed the title ~~[WIP] Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose)~~ Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) Nov 24, 2020

xadupre requested a review from snnn November 25, 2020 00:17

hariharans29 reviewed Nov 25, 2020

View reviewed changes

snnn reviewed Nov 25, 2020

View reviewed changes

add more unit tests to check the transpose does not copy values outsi…

0afd494

…de boundaries

duli2012 added the release:1.6 label Dec 1, 2020

change parenthesis with a pointer

68add22

skottmckay reviewed Dec 1, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Outdated Show resolved Hide resolved

skottmckay reviewed Dec 1, 2020

View reviewed changes

xadupre commented Dec 1, 2020

View reviewed changes

init -> Init + ORT_ENFORCE

137b8f3

skottmckay reviewed Dec 1, 2020

View reviewed changes

sdpython added 5 commits December 1, 2020 10:30

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

9d49076

…o transpose2

add comments on MultiIndex

9d40afc

add enforce

5a304a7

use structure of arrays

a1fd062

fix enforce

5c9e394

skottmckay previously approved these changes Dec 2, 2020

View reviewed changes

adding unit test for function DoTransposeImpl

2b0b4af

xadupre dismissed skottmckay’s stale review via 2b0b4af December 2, 2020 12:12

skottmckay approved these changes Dec 2, 2020

View reviewed changes

xadupre merged commit bdd06f6 into microsoft:master Dec 2, 2020

faxu added the triage:approved label Dec 4, 2020

duli2012 mentioned this pull request Dec 8, 2020

Second round of cherry-pick #6083

Merged

xadupre deleted the transpose2 branch December 14, 2020 11:13

Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) #5916

Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) #5916

Conversation

xadupre commented Nov 24, 2020 • edited Loading

hariharans29 Nov 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xadupre Nov 25, 2020 • edited Loading

Choose a reason for hiding this comment

skottmckay Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xadupre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hariharans29 commented Dec 1, 2020

skottmckay commented Dec 1, 2020

xadupre commented Dec 2, 2020

hariharans29 commented Dec 2, 2020

xadupre commented Dec 2, 2020

xadupre commented Nov 24, 2020 •

edited

Loading

hariharans29 Nov 25, 2020 •

edited

Loading

xadupre Nov 25, 2020 •

edited

Loading

skottmckay Dec 1, 2020 •

edited

Loading