Data tiling: transpose narrow-N into narrow-M #17446

lialan · 2024-05-20T21:13:33Z

(This is a rebasing PR for #16890 )

This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks).

The idea should be beneficial beyond ukernels though:

With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that.
When we get to serious distribution tiling (Proper distribution-tiling for mmt4d #16410), the reduction of generality will again greatly help.

This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in mmt4d (the t in mmt4d), as generally with matrix multiplication

B * Transpose(A) == Transpose( A * Transpose(B) )

so in mmt4d terms

mmt4d(B, A) == Transpose(mmt4d(A, B))

As pack and unpack already have enough generality to perform these transpositions, we just directly generate the right transposing pack and unpack ops. An earlier plan was to generate linalg.transpose and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed pack, unpack.

A legitimate question was: should this transposition be implemented at SetEncoding instead of at MaterializeEncoding? That would have been simpler in some ways, but:

The benefit of the transposition depends on the backend, so it doesn't belong in Flow.
SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted.
The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding.
Before MaterializeEncoding we would have to handle linalg.generic, not just named matmul ops.

Co-authored-by: Benoit Jacob jacob.benoit.1@gmail.com

benchmark-extra: x86_64-dt-only, android-cpu-dt-only

google-cla · 2024-05-20T21:13:37Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

lialan · 2024-05-20T21:14:08Z

This is a rebase PR for #16890

Credit goes to @bjacob

github-actions · 2024-05-27T21:26:57Z

Abbreviated Benchmark Summary

@ commit 8016c0ff2060c5f5f224560f1ca6fbc90d749a9f (vs. base 46c6bf5660022068cb1078bd7eeae527b13b30c2)

Data-Tiling Comparison Table

Click to show

Name	No-DT (baseline)	DT-Only	DT-UK
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	229.197 (1.0X)	133.463 (1.7X)	105.904 (2.2X)
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	758.062 (1.0X)	268.158 (2.8X)	219.101 (3.5X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	32.032 (1.0X)	37.059 (0.9X)	29.964 (1.1X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	6.949 (1.0X)	9.307 (0.7X)	8.528 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	275.107 (1.0X)	261.033 (1.1X)	229.364 (1.2X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	36.064 (1.0X)	37.462 (1.0X)	34.588 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	27.017 (1.0X)	51.712 (0.5X)	13.083 (2.1X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	5.798 (1.0X)	10.967 (0.5X)	5.005 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	69.875 (1.0X)	38.005 (1.8X)	39.913 (1.8X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	9.150 (1.0X)	9.443 (1.0X)	8.817 (1.0X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	89.836 (1.0X)	42.272 (2.1X)	41.720 (2.2X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	11.163 (1.0X)	8.469 (1.3X)	8.627 (1.3X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	79.132 (1.0X)	78.715 (1.0X)	56.941 (1.4X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	12.369 (1.0X)	14.716 (0.8X)	13.078 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	179.797 (1.0X)	254.408 (0.7X)	185.833 (1.0X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	34.292 (1.0X)	66.351 (0.5X)	62.180 (0.6X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	181.140 (1.0X)	254.875 (0.7X)	189.859 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	34.524 (1.0X)	66.259 (0.5X)	62.310 (0.6X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	482.240 (1.0X)	1063.694 (0.5X)	206.554 (2.3X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	66.341 (1.0X)	132.840 (0.5X)	62.761 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	25.373 (1.0X)	23.201 (1.1X)	17.853 (1.4X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	4.819 (1.0X)	5.394 (0.9X)	4.542 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	11.945 (1.0X)	14.682 (0.8X)	11.304 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	3.805 (1.0X)	5.287 (0.7X)	4.937 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	21.627 (1.0X)	42.370 (0.5X)	11.798 (1.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	5.904 (1.0X)	9.577 (0.6X)	5.422 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	2.774 (1.0X)	3.328 (0.8X)	2.689 (1.0X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	2.885 (1.0X)	3.471 (0.8X)	2.847 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	34.155 (1.0X)	40.624 (0.8X)	31.326 (1.1X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	8.636 (1.0X)	11.055 (0.8X)	9.867 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	0.747 (1.0X)	1.314 (0.6X)	0.539 (1.4X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	0.819 (1.0X)	1.402 (0.6X)	0.603 (1.4X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	17.648 (1.0X)	24.436 (0.7X)	18.891 (0.9X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	4.145 (1.0X)	5.982 (0.7X)	5.145 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	7.580 (1.0X)	7.603 (1.0X)	7.620 (1.0X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	48.345 (1.0X)	83.322 (0.6X)	42.908 (1.1X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	50.409 (1.0X)	83.768 (0.6X)	44.500 (1.1X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	30.307 (1.0X)	48.640 (0.6X)	27.585 (1.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	93.054 (1.0X)	21.963 (4.2X)	21.390 (4.4X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	93.998 (1.0X)	21.724 (4.3X)	21.796 (4.3X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	52.521 (1.0X)	21.781 (2.4X)	21.871 (2.4X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	133.775 (1.0X)	27.670 (4.8X)	27.923 (4.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	136.103 (1.0X)	29.233 (4.7X)	29.409 (4.6X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	74.467 (1.0X)	26.616 (2.8X)	26.739 (2.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	706.440 (1.0X)	450.671 (1.6X)	352.901 (2.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	706.218 (1.0X)	447.171 (1.6X)	356.122 (2.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	400.103 (1.0X)	271.364 (1.5X)	216.024 (1.9X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	1044.682 (1.0X)	628.401 (1.7X)	239.490 (4.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	1044.733 (1.0X)	622.467 (1.7X)	241.420 (4.3X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	541.713 (1.0X)	340.489 (1.6X)	143.666 (3.8X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	2096.865 (1.0X)	1084.293 (1.9X)	299.727 (7.0X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	2097.375 (1.0X)	1084.823 (1.9X)	303.380 (6.9X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	1122.986 (1.0X)	601.168 (1.9X)	181.771 (6.2X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	12.052 (1.0X)	10.053 (1.2X)	1.319 (9.1X)

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MiniLML12H384Uncased(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	12.369 (vs. 13.820, 10.50%↓)	12.198	0.299
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	29.233 (vs. 31.890, 8.33%↓)	29.432	0.729
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	42.908 (vs. 46.790, 8.30%↓)	42.996	0.304

[Top 3 out of 22 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats]	47104 (vs. 63264, 25.54%↓)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats]	77824 (vs. 83712, 7.03%↓)
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats]	73728 (vs. 77776, 5.20%↓)

Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉

Benchmark Name	Stream IR Dispatch Count (# of cmd.dispatch ops)
MobileSSD\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats]	222 (vs. 223, 0.45%↓)

For more information:

Source Workflow Run

bjacob · 2024-05-28T15:46:05Z

@lialan , this diff fixes the issue we were seeing on riscv CI, really an issue about properly handling the case where encoding materialization fails. The issue was that we meant to return the original tensor type in that case, but we had already mutated tensorType in-place overwriting it with the transposed type.

diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
index 769a9c9d15..83874316f8 100644
--- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
@@ -97,18 +97,19 @@ static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) {
 static RankedTensorType
 getMaterializedType(RankedTensorType tensorType,
                     MaterializeEncodingFn materializeEncodingFn) {
-  tensorType = transposeIfNarrowNResult(tensorType);
+  RankedTensorType maybeTransposedTensorType =
+      transposeIfNarrowNResult(tensorType);
   FailureOr<MaterializeEncodingInfo> materializeEncodingInfo =
-      materializeEncodingFn(tensorType);
+      materializeEncodingFn(maybeTransposedTensorType);
   if (failed(materializeEncodingInfo)) {
     return dropEncoding(tensorType);
   }
-  return cast<RankedTensorType>(
-      tensor::PackOp::inferPackedType(getOriginalTypeWithEncoding(tensorType)
-                                          .clone(tensorType.getElementType()),
-                                      materializeEncodingInfo->innerTileSizes,
-                                      materializeEncodingInfo->innerDimsPos,
-                                      materializeEncodingInfo->outerDimsPerm));
+  return cast<RankedTensorType>(tensor::PackOp::inferPackedType(
+      getOriginalTypeWithEncoding(maybeTransposedTensorType)
+          .clone(tensorType.getElementType()),
+      materializeEncodingInfo->innerTileSizes,
+      materializeEncodingInfo->innerDimsPos,
+      materializeEncodingInfo->outerDimsPerm));
 }

This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org>

ScottTodd · 2024-05-28T18:40:59Z

I see some matmul test failures on arm64 (only tested on postsubmit by default, opt-in on presubmit) that started after this was merged: https://github.com/iree-org/iree/actions/runs/9273840024/job/25514820313#step:5:2008

--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xbf16_times_13x3xbf16_into_6x3xf32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3
free(): invalid pointer

1015/1472 Test #1054: iree/tests/e2e/matmul/e2e_matmul_cpu_experimental_dt_bf16_f32_small_llvm-cpu_local-task ......................Subprocess aborted***Exception:   0.15 sec
--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_1_1_1_acc_0] ---
Matmul shape (MxKxN): 1x1x1
...
--- TEST[matmul_accumulate_DYNxDYNxbf16_times_DYNxDYNxbf16_into_DYNxDYNxf32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xbf16_times_13x3xbf16_into_6x3xf32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3
corrupted size vs. prev_size

Do those look related to this PR?

ScottTodd · 2024-05-28T18:41:51Z

Also Android test failure: https://github.com/iree-org/iree/actions/runs/9273840024/job/25515648799#step:6:579

+ adb shell 'cd /data/local/tmp && LD_LIBRARY_PATH=/data/local/tmp TEST_TMPDIR=/data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/test_tmpdir /data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/iree-check-module --module=/data/local/tmp/iree/tests/e2e/linalg/check_llvm-cpu_local-task_narrow_n_matmuls.mlir/check_llvm-cpu_local-task_narrow_n_matmuls.mlir_module.vmfb --device=local-task'
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from module
[ RUN      ] module.matvec
[       OK ] module.matvec (4 ms)
[ RUN      ] module.batch_matvec
[       OK ] module.batch_matvec (3 ms)
[ RUN      ] module.matmul_narrow_n_1
[       OK ] module.matmul_narrow_n_1 (3 ms)
[ RUN      ] module.batch_matmul_narrow_n_1
[       OK ] module.batch_matmul_narrow_n_1 (2 ms)
[ RUN      ] module.matmul_narrow_n_2
[       OK ] module.matmul_narrow_n_2 (3 ms)
[ RUN      ] module.batch_matmul_narrow_n_2
work/runtime/src/iree/modules/check/module.cc:372: Failure
Failed
Expected equality of these values. Contents does not match.
  lhs:
    2x3x2xi32=[[0 11][7 -1][-17 14]][[-11 -1][-29 -15][-24 -18]]
  rhs:
    2x3x2xi32=[[0 0][0 0][0 0]][[-11 -1][-29 -15][-24 -18]]

[  FAILED  ] module.batch_matmul_narrow_n_2, where GetParam() = 5 (2 ms)
[----------] 6 tests from module (21 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (21 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] module.batch_matmul_narrow_n_2, where GetParam() = 5

 1 FAILED TEST
Test failed
OK

bjacob · 2024-05-28T18:42:33Z

Yes, looks related. Oh well, let's revert.

This reverts commit 16bdaa9.

bjacob · 2024-05-28T18:45:15Z

@lialan , can you reproduce locally on your Mac (Arm-architecture) with AddressSanitizer?

cmake -DIREE_ENABLE_ASAN=ON

bjacob · 2024-05-28T19:00:25Z

The 2nd test pasted above by Scott looks best to start debugging from.

ctest -R narrow_n_matmul

Reverts #17446 Reason: postsubmit failures on arm64, #17446 (comment)

bjacob · 2024-05-30T01:32:20Z

(EDIT : see 2 comments down, I didn't realize that this is intermittent. I did reproduce on Mac!)

This does not reproduce on Mac.
@lialan was able to reproduce on arm64 linux; seems Linux-specific (including Android per above).

bjacob · 2024-05-30T01:40:30Z

(EDIT : see next comment, I didn't realize that this is intermittent. I did reproduce on Mac!)

Given that

This is Linux-specific and does not reproduce on Mac.
@lialan reproduced on Linux, including in an ASan build, where ASan did not flag an error.
The above Data tiling: transpose narrow-N into narrow-M #17446 (comment) seems to be about a double free, but when @lialan ran this with AddressSanitizer it didn't report one, and this being ordinary cross-platform, cross-architecture code, a double free should have been caught elsewhere.
The above Data tiling: transpose narrow-N into narrow-M #17446 (comment) shows a numerical failure pattern only in batch_matmul_narrow_2 and where the first batch is incorrectly zero and the second batch is correct, which really doesn't seem to square with what the PR is doing. A bug in this PR should have produced all-wrong values, not all-correct except for zeros in one batch.

I am starting to think that this looks like some kind of platform bug in linux/arm64, android/arm64, even though I can't tell where.

=> @lialan , I think we can go forward with disabling this test or marking it as failing on linux/arm64, android/arm64.

@ScottTodd FYI / WDYT.

bjacob · 2024-05-30T01:46:39Z

OK thanks @lialan for noticing that this is intermittent. Here on Mac/ASan/arm64 it happens every 4 attempts in average. It is not flagged as an ASan error. the failure mode is exactly as in #17446 (comment)

bjacob · 2024-05-30T01:52:39Z

Update (thanks @lialan for the suggestion): we tried single-theaded and it doesn't reproduce in that case (--task_topology_max_group_count=1). So possibly a threading bug, which would line well with it being architecture dependent (arm vs x86). Next, trying TSan.

bjacob · 2024-05-30T14:42:09Z

Discussion continuing on #17525.

(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com>

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

(This is a rebasing PR for iree-org#16890 ) This is a generic idea in the design of matrix multiplication implementations: the M and N dimensions play symmetrical roles, so there is this opportunity to halve the problem space by transposition. The immediate motivation is ukernels: we have chosen to implement narrow ukernels only for the narrow-M cases, not narrow-N, in preparation for this. This is the reason why this PR is a 5%-9% e2e speedup on multiple ML models with ukernels (and a > 2x speedup on matvec microbenchmarks). The idea should be beneficial beyond ukernels though: * With codegen (outside of ukernels), inner unit dimensions have often caused codegen to fall off of good vectorization cases. This transposition moves unit or generally smaller static dimensions to the outer dimensions, which will help with that. * When we get to serious distribution tiling (iree-org#16410), the reduction of generality will again greatly help. This transposition is made easier by (and was all along part of the idea in) the RHS-transposition in `mmt4d` (the `t` in `mmt4d`), as generally with matrix multiplication ``` B * Transpose(A) == Transpose( A * Transpose(B) ) ``` so in `mmt4d` terms ``` mmt4d(B, A) == Transpose(mmt4d(A, B)) ``` As `pack` and `unpack` already have enough generality to perform these transpositions, we just directly generate the right transposing `pack` and `unpack` ops. An earlier plan was to generate `linalg.transpose` and rely on a later folding pattern, but it turned out to just be simpler to directly generate the already-transposed `pack`, `unpack`. A legitimate question was: should this transposition be implemented at `SetEncoding` instead of at `MaterializeEncoding`? That would have been simpler in some ways, but: * The benefit of the transposition depends on the backend, so it doesn't belong in Flow. * SetEncoding must be reversible in case the back-end doesn't want to do data-tiling. The transposition would be difficult to revert, and generally confusing in settings where it's not wanted. * The above mmt4d-specific trait simplifying the transposition only helps since at MaterializeEncoding we know we are generating a mmt4d. We couldn't so easily rely on that in SetEncoding. * Before MaterializeEncoding we would have to handle `linalg.generic`, not just named matmul ops. Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> benchmark-extra: x86_64-dt-only, android-cpu-dt-only Signed-off-by: Alan Li <me@alanli.org> Co-authored-by: Benoit Jacob <jacob.benoit.1@gmail.com> Signed-off-by: Lubo Litchev <lubol@google.com>

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment) Signed-off-by: Lubo Litchev <lubol@google.com>

lialan requested a review from hanhanW as a code owner May 20, 2024 21:13

hanhanW requested review from bjacob and Max191 May 20, 2024 23:41

lialan force-pushed the transpose_rebase branch from 940fbc2 to 6ae5e61 Compare May 21, 2024 15:45

lialan requested a review from pashu123 May 22, 2024 15:17

bjacob mentioned this pull request May 27, 2024

Data tiling: transpose narrow-N into narrow-M #16890

Closed

lialan force-pushed the transpose_rebase branch 2 times, most recently from 6127e79 to 488f5f3 Compare May 27, 2024 20:44

lialan added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels May 27, 2024

lialan force-pushed the transpose_rebase branch from 34ac239 to 4a7254b Compare May 28, 2024 16:40

lialan force-pushed the transpose_rebase branch from 4a7254b to 95a138e Compare May 28, 2024 16:45

bjacob approved these changes May 28, 2024

View reviewed changes

ScottTodd merged commit 16bdaa9 into iree-org:main May 28, 2024
65 of 67 checks passed

lialan deleted the transpose_rebase branch May 28, 2024 17:48

bjacob added a commit that referenced this pull request May 28, 2024

Revert "Data tiling: transpose narrow-N into narrow-M (#17446)"

b375644

This reverts commit 16bdaa9.

bjacob mentioned this pull request May 28, 2024

Revert "Data tiling: transpose narrow-N into narrow-M" #17503

Merged

bjacob added a commit that referenced this pull request May 28, 2024

Revert "Data tiling: transpose narrow-N into narrow-M" (#17503)

ab8f668

Reverts #17446 Reason: postsubmit failures on arm64, #17446 (comment)

gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024

Revert "Data tiling: transpose narrow-N into narrow-M" (iree-org#17503)

5bb8738

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

gglangg pushed a commit to gglangg/iree that referenced this pull request Jun 4, 2024

Revert "Data tiling: transpose narrow-N into narrow-M" (iree-org#17503)

f5e3020

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Jun 5, 2024

Revert "Data tiling: transpose narrow-N into narrow-M" (iree-org#17503)

ae5a574

Reverts iree-org#17446 Reason: postsubmit failures on arm64, iree-org#17446 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data tiling: transpose narrow-N into narrow-M #17446

Data tiling: transpose narrow-N into narrow-M #17446

lialan commented May 20, 2024 •

edited

Loading

google-cla bot commented May 20, 2024

lialan commented May 20, 2024

github-actions bot commented May 27, 2024 •

edited

Loading

bjacob commented May 28, 2024

ScottTodd commented May 28, 2024

ScottTodd commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 30, 2024 •

edited

Loading

bjacob commented May 30, 2024 •

edited

Loading

bjacob commented May 30, 2024

bjacob commented May 30, 2024

bjacob commented May 30, 2024

Data tiling: transpose narrow-N into narrow-M #17446

Data tiling: transpose narrow-N into narrow-M #17446

Conversation

lialan commented May 20, 2024 • edited Loading

google-cla bot commented May 20, 2024

lialan commented May 20, 2024

github-actions bot commented May 27, 2024 • edited Loading

Abbreviated Benchmark Summary

Data-Tiling Comparison Table

Improved Latencies 🎉

Improved Total Dispatch Sizes 🎉

Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉

bjacob commented May 28, 2024

ScottTodd commented May 28, 2024

ScottTodd commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 28, 2024

bjacob commented May 30, 2024 • edited Loading

bjacob commented May 30, 2024 • edited Loading

bjacob commented May 30, 2024

bjacob commented May 30, 2024

bjacob commented May 30, 2024

lialan commented May 20, 2024 •

edited

Loading

github-actions bot commented May 27, 2024 •

edited

Loading

bjacob commented May 30, 2024 •

edited

Loading

bjacob commented May 30, 2024 •

edited

Loading