Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data tiling: transpose narrow-N into narrow-M #17446

Merged
merged 1 commit into from
May 28, 2024

Conversation

lialan
Copy link
Contributor

@lialan lialan commented May 20, 2024

(This is a rebasing PR for #16890 )

This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:

  • With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that.
  • When we get to serious distribution tiling (Proper distribution-tiling for mmt4d #16410), the reduction of generality will again greatly help.

This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in mmt4d (the t in mmt4d), as generally with matrix multiplication

B * Transpose(A) == Transpose( A * Transpose(B) )

so in mmt4d terms

mmt4d(B, A) == Transpose(mmt4d(A, B))

As pack and unpack already have enough generality to perform these transpositions, we just directly generate the right transposing pack and unpack ops. An earlier plan was to generate linalg.transpose and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed pack, unpack.

A legitimate question was: should this transposition be implemented at SetEncoding instead of at MaterializeEncoding? That would have been simpler in some ways, but:

  • The benefit of the transposition depends on the backend, so it doesn't belong in Flow.
  • SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted.
  • The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding.
  • Before MaterializeEncoding we would have to handle linalg.generic, not just named matmul ops.

Co-authored-by: Benoit Jacob jacob.benoit.1@gmail.com

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

@lialan lialan requested a review from hanhanW as a code owner May 20, 2024 21:13
Copy link

google-cla bot commented May 20, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@lialan
Copy link
Contributor Author

lialan commented May 20, 2024

This is a rebase PR for #16890

Credit goes to @bjacob

@hanhanW hanhanW requested review from bjacob and Max191 May 20, 2024 23:41
@lialan lialan force-pushed the transpose_rebase branch from 940fbc2 to 6ae5e61 Compare May 21, 2024 15:45
@lialan lialan requested a review from pashu123 May 22, 2024 15:17
@lialan lialan force-pushed the transpose_rebase branch 2 times, most recently from 6127e79 to 488f5f3 Compare May 27, 2024 20:44
@lialan lialan added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels May 27, 2024
Copy link

github-actions bot commented May 27, 2024

Abbreviated Benchmark Summary

@ commit 8016c0ff2060c5f5f224560f1ca6fbc90d749a9f (vs. base 46c6bf5660022068cb1078bd7eeae527b13b30c2)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 229.197 (1.0X) 133.463 (1.7X) 105.904 (2.2X)
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 758.062 (1.0X) 268.158 (2.8X) 219.101 (3.5X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.032 (1.0X) 37.059 (0.9X) 29.964 (1.1X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.949 (1.0X) 9.307 (0.7X) 8.528 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 275.107 (1.0X) 261.033 (1.1X) 229.364 (1.2X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 36.064 (1.0X) 37.462 (1.0X) 34.588 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 27.017 (1.0X) 51.712 (0.5X) 13.083 (2.1X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.798 (1.0X) 10.967 (0.5X) 5.005 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 69.875 (1.0X) 38.005 (1.8X) 39.913 (1.8X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 9.150 (1.0X) 9.443 (1.0X) 8.817 (1.0X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 89.836 (1.0X) 42.272 (2.1X) 41.720 (2.2X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.163 (1.0X) 8.469 (1.3X) 8.627 (1.3X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 79.132 (1.0X) 78.715 (1.0X) 56.941 (1.4X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.369 (1.0X) 14.716 (0.8X) 13.078 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 179.797 (1.0X) 254.408 (0.7X) 185.833 (1.0X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.292 (1.0X) 66.351 (0.5X) 62.180 (0.6X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 181.140 (1.0X) 254.875 (0.7X) 189.859 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.524 (1.0X) 66.259 (0.5X) 62.310 (0.6X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 482.240 (1.0X) 1063.694 (0.5X) 206.554 (2.3X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 66.341 (1.0X) 132.840 (0.5X) 62.761 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 25.373 (1.0X) 23.201 (1.1X) 17.853 (1.4X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.819 (1.0X) 5.394 (0.9X) 4.542 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.945 (1.0X) 14.682 (0.8X) 11.304 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.805 (1.0X) 5.287 (0.7X) 4.937 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.627 (1.0X) 42.370 (0.5X) 11.798 (1.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.904 (1.0X) 9.577 (0.6X) 5.422 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.774 (1.0X) 3.328 (0.8X) 2.689 (1.0X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.885 (1.0X) 3.471 (0.8X) 2.847 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.155 (1.0X) 40.624 (0.8X) 31.326 (1.1X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.636 (1.0X) 11.055 (0.8X) 9.867 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.747 (1.0X) 1.314 (0.6X) 0.539 (1.4X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.819 (1.0X) 1.402 (0.6X) 0.603 (1.4X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 17.648 (1.0X) 24.436 (0.7X) 18.891 (0.9X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.145 (1.0X) 5.982 (0.7X) 5.145 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.580 (1.0X) 7.603 (1.0X) 7.620 (1.0X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 48.345 (1.0X) 83.322 (0.6X) 42.908 (1.1X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 50.409 (1.0X) 83.768 (0.6X) 44.500 (1.1X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 30.307 (1.0X) 48.640 (0.6X) 27.585 (1.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 93.054 (1.0X) 21.963 (4.2X) 21.390 (4.4X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 93.998 (1.0X) 21.724 (4.3X) 21.796 (4.3X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 52.521 (1.0X) 21.781 (2.4X) 21.871 (2.4X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 133.775 (1.0X) 27.670 (4.8X) 27.923 (4.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 136.103 (1.0X) 29.233 (4.7X) 29.409 (4.6X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 74.467 (1.0X) 26.616 (2.8X) 26.739 (2.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 706.440 (1.0X) 450.671 (1.6X) 352.901 (2.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 706.218 (1.0X) 447.171 (1.6X) 356.122 (2.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 400.103 (1.0X) 271.364 (1.5X) 216.024 (1.9X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 1044.682 (1.0X) 628.401 (1.7X) 239.490 (4.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 1044.733 (1.0X) 622.467 (1.7X) 241.420 (4.3X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 541.713 (1.0X) 340.489 (1.6X) 143.666 (3.8X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 2096.865 (1.0X) 1084.293 (1.9X) 299.727 (7.0X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 2097.375 (1.0X) 1084.823 (1.9X) 303.380 (6.9X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 1122.986 (1.0X) 601.168 (1.9X) 181.771 (6.2X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 12.052 (1.0X) 10.053 (1.2X) 1.319 (9.1X)

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MiniLML12H384Uncased(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.369 (vs. 13.820, 10.50%↓) 12.198 0.299
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 29.233 (vs. 31.890, 8.33%↓) 29.432 0.729
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 42.908 (vs. 46.790, 8.30%↓) 42.996 0.304

[Top 3 out of 22 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] 47104 (vs. 63264, 25.54%↓)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] 77824 (vs. 83712, 7.03%↓)
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] 73728 (vs. 77776, 5.20%↓)

Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
MobileSSD\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] 222 (vs. 223, 0.45%↓)

For more information:

Source Workflow Run

@bjacob
Copy link
Contributor

bjacob commented May 28, 2024

@lialan , this diff fixes the issue we were seeing on riscv CI, really an issue about properly handling the case where encoding materialization fails. The issue was that we meant to return the original tensor type in that case, but we had already mutated tensorType in-place overwriting it with the transposed type.

diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
index 769a9c9d15..83874316f8 100644
--- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
@@ -97,18 +97,19 @@ static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) {
 static RankedTensorType
 getMaterializedType(RankedTensorType tensorType,
                     MaterializeEncodingFn materializeEncodingFn) {
-  tensorType = transposeIfNarrowNResult(tensorType);
+  RankedTensorType maybeTransposedTensorType =
+      transposeIfNarrowNResult(tensorType);
   FailureOr<MaterializeEncodingInfo> materializeEncodingInfo =
-      materializeEncodingFn(tensorType);
+      materializeEncodingFn(maybeTransposedTensorType);
   if (failed(materializeEncodingInfo)) {
     return dropEncoding(tensorType);
   }
-  return cast<RankedTensorType>(
-      tensor::PackOp::inferPackedType(getOriginalTypeWithEncoding(tensorType)
-                                          .clone(tensorType.getElementType()),
-                                      materializeEncodingInfo->innerTileSizes,
-                                      materializeEncodingInfo->innerDimsPos,
-                                      materializeEncodingInfo->outerDimsPerm));
+  return cast<RankedTensorType>(tensor::PackOp::inferPackedType(
+      getOriginalTypeWithEncoding(maybeTransposedTensorType)
+          .clone(tensorType.getElementType()),
+      materializeEncodingInfo->innerTileSizes,
+      materializeEncodingInfo->innerDimsPos,
+      materializeEncodingInfo->outerDimsPerm));
 }

@lialan lialan force-pushed the transpose_rebase branch from 34ac239 to 4a7254b Compare May 28, 2024 16:40
This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:
* With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that.
* When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help.

This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication

```
B * Transpose(A) == Transpose( A * Transpose(B) )
```

so in `mmt4d` terms

```
mmt4d(B, A) == Transpose(mmt4d(A, B))
```

As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`.

A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but:
* The benefit of the transposition depends on the backend, so it doesn't belong in Flow.
* SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted.
* The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding.
* Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops.

Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

benchmark-extra: x86_64-dt-only, android-cpu-dt-only
Signed-off-by: Alan Li <me@alanli.org>
@lialan lialan force-pushed the transpose_rebase branch from 4a7254b to 95a138e Compare May 28, 2024 16:45
@ScottTodd ScottTodd merged commit 16bdaa9 into iree-org:main May 28, 2024
65 of 67 checks passed
@lialan lialan deleted the transpose_rebase branch May 28, 2024 17:48
@ScottTodd
Copy link
Member

I see some matmul test failures on arm64 (only tested on postsubmit by default, opt-in on presubmit) that started after this was merged: https://github.com/iree-org/iree/actions/runs/9273840024/job/25514820313#step:5:2008

--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xbf16_times_13x3xbf16_into_6x3xf32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3
free(): invalid pointer

1015/1472 Test #1054: iree/tests/e2e/matmul/e2e_matmul_cpu_experimental_dt_bf16_f32_small_llvm-cpu_local-task ......................Subprocess aborted***Exception:   0.15 sec
--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_1_1_1_acc_0] ---
Matmul shape (MxKxN): 1x1x1
...
--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xbf16_times_13x3xbf16_into_6x3xf32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3
corrupted size vs. prev_size

Do those look related to this PR?

@ScottTodd
Copy link
Member

Also Android test failure: https://github.com/iree-org/iree/actions/runs/9273840024/job/25515648799#step:6:579

+ adb shell 'cd /data/local/tmp && LD_LIBRARY_PATH=/data/local/tmp TEST_TMPDIR=/data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/test_tmpdir /data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/iree-check-module --module=/data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/check_llvm-cpu_local-task_narrow_n_matmuls.mlir_module.vmfb --device=local-task'
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from module
[ RUN      ] module.matvec
[       OK ] module.matvec (4 ms)
[ RUN      ] module.batch_matvec
[       OK ] module.batch_matvec (3 ms)
[ RUN      ] module.matmul_narrow_n_1
[       OK ] module.matmul_narrow_n_1 (3 ms)
[ RUN      ] module.batch_matmul_narrow_n_1
[       OK ] module.batch_matmul_narrow_n_1 (2 ms)
[ RUN      ] module.matmul_narrow_n_2
[       OK ] module.matmul_narrow_n_2 (3 ms)
[ RUN      ] module.batch_matmul_narrow_n_2
work/runtime/src/iree/modules/check/module.cc:372: Failure
Failed
Expected equality of these values. Contents does not match.
  lhs:
    2x3x2xi32=[[0 11][7 -1][-17 14]][[-11 -1][-29 -15][-24 -18]]
  rhs:
    2x3x2xi32=[[0 0][0 0][0 0]][[-11 -1][-29 -15][-24 -18]]

[  FAILED  ] module.batch_matmul_narrow_n_2, where GetParam() = 5 (2 ms)
[----------] 6 tests from module (21 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (21 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] module.batch_matmul_narrow_n_2, where GetParam() = 5

 1 FAILED TEST
Test failed
OK

@bjacob
Copy link
Contributor

bjacob commented May 28, 2024

Yes, looks related. Oh well, let's revert.

@bjacob
Copy link
Contributor

bjacob commented May 28, 2024

@lialan , can you reproduce locally on your Mac (Arm-architecture) with AddressSanitizer?

cmake -DIREE_ENABLE_ASAN=ON

@bjacob
Copy link
Contributor

bjacob commented May 28, 2024

The 2nd test pasted above by Scott looks best to start debugging from.

ctest -R narrow_n_matmul

bjacob added a commit that referenced this pull request May 28, 2024
@bjacob
Copy link
Contributor

bjacob commented May 30, 2024

(EDIT : see 2 comments down, I didn't realize that this is intermittent. I did reproduce on Mac!)

This does not reproduce on Mac.
@lialan was able to reproduce on arm64 linux; seems Linux-specific (including Android per above).

@bjacob
Copy link
Contributor

bjacob commented May 30, 2024

(EDIT : see next comment, I didn't realize that this is intermittent. I did reproduce on Mac!)

Given that

  • This is Linux-specific and does not reproduce on Mac.
  • @lialan reproduced on Linux, including in an ASan build, where ASan did not flag an error.
  • The above Data tiling: transpose narrow-N into narrow-M #17446 (comment) seems to be about a double free, but when @lialan ran this with AddressSanitizer it didn't report one, and this being ordinary cross-platform, cross-architecture code, a double free should have been caught elsewhere.
  • The above Data tiling: transpose narrow-N into narrow-M #17446 (comment) shows a numerical failure pattern only in batch_matmul_narrow_2 and where the first batch is incorrectly zero and the second batch is correct, which really doesn't seem to square with what the PR is doing. A bug in this PR should have produced all-wrong values, not all-correct except for zeros in one batch.

I am starting to think that this looks like some kind of platform bug in linux/arm64, android/arm64, even though I can't tell where.

=> @lialan , I think we can go forward with disabling this test or marking it as failing on linux/arm64, android/arm64.

@ScottTodd FYI / WDYT.

@bjacob
Copy link
Contributor

bjacob commented May 30, 2024

OK thanks @lialan for noticing that this is intermittent. Here on Mac/ASan/arm64 it happens every 4 attempts in average. It is not flagged as an ASan error. the failure mode is exactly as in #17446 (comment)

@bjacob
Copy link
Contributor

bjacob commented May 30, 2024

Update (thanks @lialan for the suggestion): we tried single-theaded and it doesn't reproduce in that case (--task_topology_max_group_count=1). So possibly a threading bug, which would line well with it being architecture dependent (arm vs x86). Next, trying TSan.

@bjacob
Copy link
Contributor

bjacob commented May 30, 2024

Discussion continuing on #17525.

gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024
(This is a rebasing PR for iree-org#16890 )

This is a generic idea in the design of matrix multiplication
implementations: the M and N dimensions play symmetrical roles, so there
is this opportunity to halve the problem space by transposition. The
immediate motivation is ukernels: we have chosen to implement narrow
ukernels only for the narrow-M cases, not narrow-N, in preparation for
this. This is the reason why this PR is a 5%-9% e2e speedup on multiple
ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:
* With codegen (outside of ukernels), inner unit dimensions have often
caused codegen to fall off of good vectorization cases. This
transposition moves unit or generally smaller static dimensions to the
outer dimensions, which will help with that.
* When we get to serious distribution tiling (iree-org#16410), the reduction of
generality will again greatly help.

This transposition is made easier by (and was all along part of the idea
in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally
with matrix multiplication

```
B * Transpose(A) == Transpose( A * Transpose(B) )
```

so in `mmt4d` terms

```
mmt4d(B, A) == Transpose(mmt4d(A, B))
```

As `pack` and `unpack` already have enough generality to perform these
transpositions, we just directly generate the right transposing `pack`
and `unpack` ops. An earlier plan was to generate `linalg.transpose` and
rely on a later folding pattern, but it turned out to just be simpler to
directly generate the already-transposed `pack`, `unpack`.

A legitimate question was: should this transposition be implemented at
`SetEncoding` instead of at `MaterializeEncoding`? That would have been
simpler in some ways, but:
* The benefit of the transposition depends on the backend, so it doesn't
belong in Flow.
* SetEncoding must be reversible in case the back-end doesn't want to do
data-tiling. The transposition would be difficult to revert, and
generally confusing in settings where it's not wanted.
* The above mmt4d-specific trait simplifying the transposition only
helps since at MaterializeEncoding we know we are generating a mmt4d. We
couldn't so easily rely on that in SetEncoding.
* Before MaterializeEncoding we would have to handle `linalg.generic`,
not just named matmul ops.

Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

Signed-off-by: Alan Li <me@alanli.org>
Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024
gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024
(This is a rebasing PR for iree-org#16890 )

This is a generic idea in the design of matrix multiplication
implementations: the M and N dimensions play symmetrical roles, so there
is this opportunity to halve the problem space by transposition. The
immediate motivation is ukernels: we have chosen to implement narrow
ukernels only for the narrow-M cases, not narrow-N, in preparation for
this. This is the reason why this PR is a 5%-9% e2e speedup on multiple
ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:
* With codegen (outside of ukernels), inner unit dimensions have often
caused codegen to fall off of good vectorization cases. This
transposition moves unit or generally smaller static dimensions to the
outer dimensions, which will help with that.
* When we get to serious distribution tiling (iree-org#16410), the reduction of
generality will again greatly help.

This transposition is made easier by (and was all along part of the idea
in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally
with matrix multiplication

```
B * Transpose(A) == Transpose( A * Transpose(B) )
```

so in `mmt4d` terms

```
mmt4d(B, A) == Transpose(mmt4d(A, B))
```

As `pack` and `unpack` already have enough generality to perform these
transpositions, we just directly generate the right transposing `pack`
and `unpack` ops. An earlier plan was to generate `linalg.transpose` and
rely on a later folding pattern, but it turned out to just be simpler to
directly generate the already-transposed `pack`, `unpack`.

A legitimate question was: should this transposition be implemented at
`SetEncoding` instead of at `MaterializeEncoding`? That would have been
simpler in some ways, but:
* The benefit of the transposition depends on the backend, so it doesn't
belong in Flow.
* SetEncoding must be reversible in case the back-end doesn't want to do
data-tiling. The transposition would be difficult to revert, and
generally confusing in settings where it's not wanted.
* The above mmt4d-specific trait simplifying the transposition only
helps since at MaterializeEncoding we know we are generating a mmt4d. We
couldn't so easily rely on that in SetEncoding.
* Before MaterializeEncoding we would have to handle `linalg.generic`,
not just named matmul ops.

Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

Signed-off-by: Alan Li <me@alanli.org>
Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Jun 5, 2024
(This is a rebasing PR for iree-org#16890 )

This is a generic idea in the design of matrix multiplication
implementations: the M and N dimensions play symmetrical roles, so there
is this opportunity to halve the problem space by transposition. The
immediate motivation is ukernels: we have chosen to implement narrow
ukernels only for the narrow-M cases, not narrow-N, in preparation for
this. This is the reason why this PR is a 5%-9% e2e speedup on multiple
ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:
* With codegen (outside of ukernels), inner unit dimensions have often
caused codegen to fall off of good vectorization cases. This
transposition moves unit or generally smaller static dimensions to the
outer dimensions, which will help with that.
* When we get to serious distribution tiling (iree-org#16410), the reduction of
generality will again greatly help.

This transposition is made easier by (and was all along part of the idea
in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally
with matrix multiplication

```
B * Transpose(A) == Transpose( A * Transpose(B) )
```

so in `mmt4d` terms

```
mmt4d(B, A) == Transpose(mmt4d(A, B))
```

As `pack` and `unpack` already have enough generality to perform these
transpositions, we just directly generate the right transposing `pack`
and `unpack` ops. An earlier plan was to generate `linalg.transpose` and
rely on a later folding pattern, but it turned out to just be simpler to
directly generate the already-transposed `pack`, `unpack`.

A legitimate question was: should this transposition be implemented at
`SetEncoding` instead of at `MaterializeEncoding`? That would have been
simpler in some ways, but:
* The benefit of the transposition depends on the backend, so it doesn't
belong in Flow.
* SetEncoding must be reversible in case the back-end doesn't want to do
data-tiling. The transposition would be difficult to revert, and
generally confusing in settings where it's not wanted.
* The above mmt4d-specific trait simplifying the transposition only
helps since at MaterializeEncoding we know we are generating a mmt4d. We
couldn't so easily rely on that in SetEncoding.
* Before MaterializeEncoding we would have to handle `linalg.generic`,
not just named matmul ops.

Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

Signed-off-by: Alan Li <me@alanli.org>
Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Jun 5, 2024
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
(This is a rebasing PR for iree-org#16890 )

This is a generic idea in the design of matrix multiplication
implementations: the M and N dimensions play symmetrical roles, so there
is this opportunity to halve the problem space by transposition. The
immediate motivation is ukernels: we have chosen to implement narrow
ukernels only for the narrow-M cases, not narrow-N, in preparation for
this. This is the reason why this PR is a 5%-9% e2e speedup on multiple
ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:
* With codegen (outside of ukernels), inner unit dimensions have often
caused codegen to fall off of good vectorization cases. This
transposition moves unit or generally smaller static dimensions to the
outer dimensions, which will help with that.
* When we get to serious distribution tiling (iree-org#16410), the reduction of
generality will again greatly help.

This transposition is made easier by (and was all along part of the idea
in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally
with matrix multiplication

```
B * Transpose(A) == Transpose( A * Transpose(B) )
```

so in `mmt4d` terms

```
mmt4d(B, A) == Transpose(mmt4d(A, B))
```

As `pack` and `unpack` already have enough generality to perform these
transpositions, we just directly generate the right transposing `pack`
and `unpack` ops. An earlier plan was to generate `linalg.transpose` and
rely on a later folding pattern, but it turned out to just be simpler to
directly generate the already-transposed `pack`, `unpack`.

A legitimate question was: should this transposition be implemented at
`SetEncoding` instead of at `MaterializeEncoding`? That would have been
simpler in some ways, but:
* The benefit of the transposition depends on the backend, so it doesn't
belong in Flow.
* SetEncoding must be reversible in case the back-end doesn't want to do
data-tiling. The transposition would be difficult to revert, and
generally confusing in settings where it's not wanted.
* The above mmt4d-specific trait simplifying the transposition only
helps since at MaterializeEncoding we know we are generating a mmt4d. We
couldn't so easily rely on that in SetEncoding.
* Before MaterializeEncoding we would have to handle `linalg.generic`,
not just named matmul ops.

Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

Signed-off-by: Alan Li <me@alanli.org>
Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Signed-off-by: Lubo Litchev <lubol@google.com>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
Reverts iree-org#17446

Reason: postsubmit failures on arm64,
iree-org#17446 (comment)
Signed-off-by: Lubo Litchev <lubol@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants