-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data tiling: transpose narrow-N into narrow-M #17446
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
6127e79
to
488f5f3
Compare
Abbreviated Benchmark Summary@ commit 8016c0ff2060c5f5f224560f1ca6fbc90d749a9f (vs. base 46c6bf5660022068cb1078bd7eeae527b13b30c2) Data-Tiling Comparison TableClick to show
Improved Latencies 🎉
[Top 3 out of 22 results showed] Improved Total Dispatch Sizes 🎉
Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉
For more information: |
@lialan , this diff fixes the issue we were seeing on riscv CI, really an issue about properly handling the case where encoding materialization fails. The issue was that we meant to return the original tensor type in that case, but we had already mutated diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
index 769a9c9d15..83874316f8 100644
--- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
@@ -97,18 +97,19 @@ static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) {
static RankedTensorType
getMaterializedType(RankedTensorType tensorType,
MaterializeEncodingFn materializeEncodingFn) {
- tensorType = transposeIfNarrowNResult(tensorType);
+ RankedTensorType maybeTransposedTensorType =
+ transposeIfNarrowNResult(tensorType);
FailureOr<MaterializeEncodingInfo> materializeEncodingInfo =
- materializeEncodingFn(tensorType);
+ materializeEncodingFn(maybeTransposedTensorType);
if (failed(materializeEncodingInfo)) {
return dropEncoding(tensorType);
}
- return cast<RankedTensorType>(
- tensor::PackOp::inferPackedType(getOriginalTypeWithEncoding(tensorType)
- .clone(tensorType.getElementType()),
- materializeEncodingInfo->innerTileSizes,
- materializeEncodingInfo->innerDimsPos,
- materializeEncodingInfo->outerDimsPerm));
+ return cast<RankedTensorType>(tensor::PackOp::inferPackedType(
+ getOriginalTypeWithEncoding(maybeTransposedTensorType)
+ .clone(tensorType.getElementType()),
+ materializeEncodingInfo->innerTileSizes,
+ materializeEncodingInfo->innerDimsPos,
+ materializeEncodingInfo->outerDimsPerm));
} |
This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org>
I see some matmul test failures on arm64 (only tested on postsubmit by default, opt-in on presubmit) that started after this was merged: https://github.com/iree-org/iree/actions/runs/9273840024/job/25514820313#step:5:2008
Do those look related to this PR? |
Also Android test failure: https://github.com/iree-org/iree/actions/runs/9273840024/job/25515648799#step:6:579
|
Yes, looks related. Oh well, let's revert. |
@lialan , can you reproduce locally on your Mac (Arm-architecture) with AddressSanitizer?
|
The 2nd test pasted above by Scott looks best to start debugging from.
|
Reverts #17446 Reason: postsubmit failures on arm64, #17446 (comment)
(EDIT : see 2 comments down, I didn't realize that this is intermittent. I did reproduce on Mac!) This does not reproduce on Mac. |
(EDIT : see next comment, I didn't realize that this is intermittent. I did reproduce on Mac!) Given that
I am starting to think that this looks like some kind of platform bug in linux/arm64, android/arm64, even though I can't tell where. => @lialan , I think we can go forward with disabling this test or marking it as failing on linux/arm64, android/arm64. @ScottTodd FYI / WDYT. |
OK thanks @lialan for noticing that this is intermittent. Here on Mac/ASan/arm64 it happens every 4 attempts in average. It is not flagged as an ASan error. the failure mode is exactly as in #17446 (comment) |
Update (thanks @lialan for the suggestion): we tried single-theaded and it doesn't reproduce in that case ( |
Discussion continuing on #17525. |
(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)
(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)
(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)
(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> Signed-off-by: Lubo Litchev <lubol@google.com>
Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment) Signed-off-by: Lubo Litchev <lubol@google.com>
(This is a rebasing PR for #16890 )
This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).
The idea should be beneficial beyond ukernels though:
mmt4d
#16410), the reduction of generality will again greatly help.This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in
mmt4d
(thet
inmmt4d
), as generally with matrix multiplicationso in
mmt4d
termsAs
pack
andunpack
already have enough generality to perform these transpositions, we just directly generate the right transposingpack
andunpack
ops. An earlier plan was to generatelinalg.transpose
and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposedpack
,unpack
.A legitimate question was: should this transposition be implemented at
SetEncoding
instead of atMaterializeEncoding
? That would have been simpler in some ways, but:linalg.generic
, not just named matmul ops.Co-authored-by: Benoit Jacob jacob.benoit.1@gmail.com
benchmark-extra: x86_64-dt-only, android-cpu-dt-only