-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) #5916
Conversation
@@ -246,6 +246,39 @@ TEST(TransposeOpTest, ThreeDimSuffix) { | |||
TransposeTest(input_shape, input_vals, &perm, expected_shape, expected_vals, false); //TensorRT: illegal error | |||
} | |||
|
|||
TEST(TransposeOpTest, TransposeReshape) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test(s) for the kind of inputs that would have caused an issue previously (i.e.) the reason for which the PR as reverted ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I checked this test failed with the previous version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - Did this specific test fail with the old version ? It seems to be the same as the old checked-in test....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it fails with the old version. The test I mention is here. https://github.com/microsoft/onnxruntime/pull/5916/files#diff-fc7da1e84aadf77e977b26bbf56319b4cf1ffb9ae7fbfdc539ab84cb2e971582R517 It creates a buffer bigger than the expected output and fills it with a constant. After the function is called, I checks the constant did not change outside the expected output.
size_t index; | ||
size_t upper_bound; | ||
int64_t stride; | ||
MultiIndex() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the default constructor, you leave all the variables uninitialised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to save time. The default constructor is called when creating an array. Even if they were initialized, they would be overwritten after. This index is not exposed in the header, I don't expect this index to be used somewhere else in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you should have an init function instead of using operator= to fill it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the 'typedef' isn't needed.
also a comment capturing why the values are not explicitly initialized would be helpful to prevent someone innocently changing that in the future causing a perf regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with this change. Probably needs sign-off from @skottmckay.
Was just waiting to see if there were any perf issues from the ORT_ENFORCE in the loop. For this release I'm not convinced we need that given the issue wasn't the calculations picking a bad offset, so if there's an issue we can back that out.
I could not measure the differnce between with/without ORT_ENFORCE in the loops. It was not significant so I let them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I would prefer for DoTransposeEltWise to not be public given this bounds check, but that can be a separate PR.
const uint8_t* local_source = source; | ||
for (size_t i = 0; i < num_blocks; ++i) { | ||
CopyPrim<T>(target, local_source); | ||
IncrementIndexAndComputeOffset(mindex.data(), naxes, local_source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most difficult part is how would you prove this code is correct? Here the local_source pointer may get moved randomly, how would you prove it won't be out of bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unit test did not catch this issue because it was compiled with pointer alignment but there should be a way to test that if one value is copied after an another one but at a lower position in the transposed array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a unit test to check that this function does not go outside the output vector.
for (size_t i = 0; i < num_blocks; ++i) { | ||
CopyPrim<T>(target, local_source); | ||
IncrementIndexAndComputeOffset(mindex.data(), naxes, local_source); | ||
target += sizeof(T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, target should be T*.
And the for loop would be like:
T* target_end = target+num_blocks;
for(T* p=target; p!=target_end;++p){
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could save another variable.
index[k] = 0; | ||
// Combines multi-index increment and corresponding pointer in the string tensor to transpose. | ||
static void IncrementIndexAndComputeOffset(MultiIndex* mindex, size_t naxes, const std::string*& local_source) { | ||
MultiIndex* it = mindex + (naxes - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naxes is size_t type. naxes - 1 is dangerous unless you explicitly say naxes won't be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added parenthesis to get (mindex + naxes) - 1
, this should solve this issue.
@@ -560,6 +569,20 @@ static bool IsMovingSingleAxis(const std::vector<size_t>& permutations, size_t& | |||
return single_axis_moved; | |||
} | |||
|
|||
bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be tested individually. It would be better if there are some unit tests for it.
And, what if perm
contains invalid values? Can you trust the input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an existing function that can be reused at there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to expose that function in the header then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw in the internal usage the values of perm are validated earlier. by making it public it changes what could in theory be passed (potentially leading to someone adding unnecessary code to re-validate perm for the theoretical but non-existent use-case).
local_source += it->stride; | ||
if (++it->index < it->upper_bound) | ||
return; | ||
local_source -= it->stride * it->index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why local_source still points to valid memory after these movements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code was this one:
for (int64_t j = 0; j < num_axes; ++j) {
offset += index[j] * stride[j];
and local_source = source + offset.
The offset was computed after every iteration but most of time, offset is incremented (local_source += it->stride). If the transposed matrix has 4 dimensions, this loop does one increment and a copy most of the time. The previous loop was computing the offset each time (4 multiplications and additions) and then was copying the value to its transposed location. offset and local_source are the same but local_source is obtained with less operations. I could add some verifications but I would prefer to do it only in debug mode.
template <class T> | ||
static void TypedDoTransposeEltWise(int64_t num_axes, const std::vector<int64_t>& target_dims, size_t num_blocks, | ||
const std::vector<size_t>& stride, const uint8_t* source, uint8_t* target) { | ||
std::vector<MultiIndex> mindex(num_axes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this took args for the end of the source and target data, could we do an explicit check we haven't gone out of bounds? Would it be possible to just do that check once at the end of the 'for' loop that copies?
That would catch any future issues, and we wouldn't need the hack to expose DoTransposeEltWise to a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to measure how much time would be spent in the testing before answering. Not sure checking at the end of the loop would work. The loop could go outside then go back inside. In the previous version, the pointer stayed all the time within boundaries. It created a failure because instead of writing 1 byte, it was writing 8 bytes at a time: 1 byte inside, 7 bytes outside for the last affectation. It did not fail on my machine because the compilation aligned the pointer to 8 bytes and the pointer on the target buffer always increases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow that. the sizes of the blocks to write should be based on the input dimensions, so how do we get a mismatch where it's attempting to write 8 bytes for a 1 byte block? that sounds more like a bug figuring out block sizes and the check should be where we determine that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and looked at the old code and it seemed the bug was hardcoding a type in this call that had no relation to the data type being copied.
So this:
CopyPrim<uint64_t>(target, local_source);
is now
CopyPrim<T>(target, local_source);
Is that correct?
If so there was never an incorrect calculation of an offset that violated the bounds was there? It was just that the copy wasn't limited to the correct number of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I replaced it by a template to reduce the duplication of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on that I'm not convinced we need to go all out to validate the bounds. There was never an issue with those - it was simply the use of the wrongly sized type in the copy.
Instead is it possible to craft a unit test that broke with the incorrect copy size without explicitly checking the bounds? you're writing 8 bytes in places where 1 should be written, so I would expect with carefully chosen input shape and transpose perms you could cause a later write to overwrite some of an earlier one leading to the output not matching the expected values. that would prevent regression without needing extra ORT_ENFORCE statements in the loop or exposing internal helper methods just for a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I exposed this function in the header in order to be able to write a unit test on that specific function. Should I put it back?
static void IncrementIndexAndComputeOffset(MultiIndex* mindex, size_t naxes, const T*& local_source) { | ||
MultiIndex* it = (mindex + naxes) - 1; | ||
local_source += it->stride; | ||
if (++it->index < it->upper_bound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would really help to have some comments on a number of lines here explaining the logic as there's a lot of intrinsic knowledge required to understand the steps. Would help detect potential issues as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments. I'll make some benchmark today to measure the impact of the instructions ORT_ENFORCE inside the loops involving indices.
I am good with this change. Probably needs sign-off from @skottmckay. |
Was just waiting to see if there were any perf issues from the ORT_ENFORCE in the loop. For this release I'm not convinced we need that given the issue wasn't the calculations picking a bad offset, so if there's an issue we can back that out. |
Last change: I added two unit tests today because I noticed two functions (DoTransposeImpl) were not covered by a unit test. |
Are we keeping the ORT_ENFORCEs in the loops ? |
I kept them because they don't harm the performance. |
* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl * Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965) * improves processing time by 10 * extend coverage unit test coverage * better implementation for the multi regression case * better comment, keep parallelization by trees when not enough trees * Initialize a structure in operator ReduceSum (#6005) * fix initialisation issue * Fuse MatMulIntegerToFloat only when scales are scalar (#6008) MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully. * Disable Python 3.9 for training Python packaging build. (#6012) Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency. * Fix bugs for 1: Calibrator should check model inputs; 2: (#6017) quantize_inupts forgot to use parameter initializer_use_weight_qtyp. * Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1. - [Release notes](https://github.com/highlightjs/highlight.js/releases) - [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md) - [Commits](highlightjs/highlight.js@10.2.1...10.4.1) Signed-off-by: dependabot[bot] <support@github.com> * work around of the build break in mac (#6069) * Fix the build break in macos release * revert android change * Bump up API version for 1.6 release (#6076) * Update version to 1.6.0 (#6041) * Update version to 1.6.0 * Add v 1.5.3 info * Updating WindowsAI and ONNX version Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" This reverts commit beb950e. Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Zhang Lei <zhang.huanning@hotmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pranav Sharma <prs@microsoft.com> Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Description:
See #5550, #5911.
BEFORE
AFTER
On this example, the ratio new implementation / previous implementation is ~ 0.6 (for all N). The comparison with torch and numpy is difficult as the transpose only permutes dimensions and strides but without changing the data.
Measures on model bert_en_cased_L-12_H-768_A-12_2
BEFORE
AFTER
Motivation and Context
See #5550.