-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race between worker threads sharing a buffer #17525
Comments
Interpreting the TSan log: Two worker threads, Thread T57 'iree-worker-0' and Thread T64 'iree-worker-7', are accessing a shared memory buffer without adequate ordering. T57 is trying to map the buffer, which thanks to the IREE_HAL_MEMORY_ACCESS_DISCARD flag + assertions enabled, is a memory write, helping making this reproduce consistently in TSan:
Note: the buffer_heap.c code in above frame iree/runtime/src/iree/hal/buffer_heap.c Lines 268 to 272 in 4fddac0
T64 had previously written to the buffer inside of dispatch function code:
|
A question for @benvanik - when two worker threads cooperate on a buffer, where is the synchronization code that is meant to prevent a data race? Is there code meant to ensure that no two tasks sharing a buffer can be scheduled concurrently? |
There's nothing that implicitly synchronizes access. If two concurrently executing operations access the same locations in memory they must use atomics to do their work. I don't know what the above is, but that looks like a miscompile in codegen or a TSAN misidentification: a command buffer copy and a dispatch should not be accessing the same memory concurrently. (or benoit will school me on some fun memory system detail again and we'll find something juicy to fix ;) |
That's what I'm asking about: when you say "a command buffer copy and a dispatch should not be accessing the same memory concurrently", what specific piece of code implements that "should not" ? |
The compiler must insert barriers - the lack of barriers implies no synchronization is required. The barriers show as structured statements in the stream IR with any sibling operations within a |
Oh i see, it's in the compiler generated code. I was mistakenly looking for barriers in the runtime. OK, so that means we should be looking at IR. Can you suggest where to look? @lialan can you edit the source of that |
(but an ir-after-all dump is best - then you can search from the bottom up to the last |
(message crossed your edit --- OK, let's do both @lialan ) . ah ok. that's going to be much lighter than |
@benvanik @bjacob Attaching the dumps: This one is just |
lol your print ir after all dump is truncated because TSAN is catching a signal:
that may be the root cause of the issue here and you should check on that. The IR as it pertains to runtime scheduling seems fine to me.
Since there's no reuse here there's really not anything within the two concurrent regions that could have TSAN conflicts unless the dispatches are incorrectly touching memory. For example, if the unset_encoding is touching %arg4 outside of the range of [0,48) bytes then it will stomp on the copy touching [64,112). The address at fault in the TSAN report is 256B into the allocation but we slab allocate so that doesn't mean 256B into the buffer (as there's the buffer header in there). The HAL IR or a runtime |
(I do suspect that assertion is guarding correct behavior, and we really should not be relying on assertions for things like that - if analysis fails then we need to error out properly) /cc @hanhanW |
Darn, thanks, that is indeed related to the PR that triggers this. Looking into this.
it's just a misnomer - we're encoding the input accumulator here; encodings don't care about input vs output though, so it's really just the accumulator encoding, which got named the result encoding, which is fair when we're talking about the output accumulator, which is the result, but is confusing when talking about the input accumulator. |
I see, the assert failure reproduces only with |
It's actually a red herring / unrelated issue. In particular, Fix: diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
index c1948b9418..64e9ad3216 100644
--- a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
@@ -334,7 +334,7 @@ static TileMxNxK
chooseMatmulTile(ArrayRef<TileMxNxK> enumeratedTiles, int64_t matmulNarrowM,
int64_t matmulNarrowN,
ArrayRef<int64_t> hostDefinedUpperBound = {}) {
- assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() == 3) &&
+ assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() >= 3) &&
"expected hostDefinedUpperBound is empty or has upper bound for {M, "
"N, K}");
// Handle narrow-N by transposing to reduce to narrow-M. Note: the
diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
index 83874316f8..cfb11c36f0 100644
--- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
@@ -65,10 +65,10 @@ static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) {
// auto newRoundDimsTo = encoding.getRoundDimsToArray();
SmallVector<int64_t> newRoundDimsTo(encoding.getRoundDimsToArray());
- assert(newRoundDimsTo.size() == 0 || newRoundDimsTo.size() == 3);
- if (newRoundDimsTo.size() != 0)
- std::swap(newRoundDimsTo[0], newRoundDimsTo[1]);
-
+ assert(newRoundDimsTo.size() == 0 || newRoundDimsTo.size() >= 3);
+ if (newRoundDimsTo.size() != 0) {
+ std::swap(newRoundDimsTo[newRoundDimsTo.size() - 3], newRoundDimsTo[newRoundDimsTo.size() - 2]);
+ }
auto context = tensorType.getContext();
AffineMap permutation = AffineMap::getPermutationMap(permIndices, context);
for (auto &map : maps) { However: that is a lot of gratuitous complexity here from the fact that Here is the print-ir-after-all regenerated with that so that we get past this assertion: |
@benvanik , for this to manifest only on arm64, this must be almost correct and only incorrect in some lower-level detail. So it's not surprising that looking at Stream above, everything was OK. I'm curious how these things get lowered down to synchronization primitives later on (HAL?) hopefully something we can see now with the print-ir-after-all in the previous comment (getting past that assertion). Can you pinpoint the ops implementing both sides of the synchronization that should be happening, and point to their C implementations, so we can start to reason about what they are doing? |
There's no difference across targets on the runtime side. If this is only showing with arm64 you'll definitely need to look at your dispatches. Perhaps it's something that TSAN on other targets is not catching that arm64 is though that's doubtful that there's target-specific functionality in TSAN - though maybe it's target-specific rules that only apply there. AFAIK the only variances in sanitizers have been around aligned/unaligned accesses and such, but sanitizers work in mysterious ways (sometimes AFAICT not even known to the creators :). Nearly all times there's been an issue with clobbering data/races it's been in the dispatches. The only thing the host-side compiler/runtime does is provide appropriate buffer ranges to the dispatches and here they look correct. Assuming it's the dispatch_4 unset_encoding overlapping with the copy the dispatch_4 should only be writing bytes 0 to 48 (given |
(if you can tell me 100% that the dispatch is never possibly accessing a range the copy is and the copy is writing to the correct range we can talk about how tasks are scheduled - I don't really think it's useful to go down that route until confident, though) |
Well it is true that this issue occurred on ARM64 CI. However given the fact that this is intermittent and recurrence rate is not 100%, we cannot rule out the possibility that it can be reproduced on other platforms as well. @bjacob given the same configuration, is it true that we will reach a same/similar execution path on x86? Worth trying out? |
Sure, worth retrying. The last time I tried on x86, I didn't retry enough times to be 100% sure. I have pushed my fixes to the above assertion failure, on top of your https://github.com/iree-org/iree/tree/bjacob-transpose-rebase-repro Also sharing the small reproducer with just the failing testcase: https://gist.github.com/bjacob/f248a132c0a9af929047b898936fa8c0 And my command line to build and run this with retrying until failure:
Here are my observations on Mac/arm64:
|
Also, since we are considering the possibility of dispatches writing out of bounds, we should check ASan. |
Here is the trace from |
Tried >10000 runs with @bjacob 's branch and the script, 0 failure on x86. |
From the trace, we see that the data race is between a dispatch that is the Notice that in the error message, the erroneous side is the RHS --- which is, the golden results:
so the dispatch 4, which does the Looking at the trace, this makes sense as both the dispatch 4 and the copy_buffer are scheduled concurrently writing to the same destination buffer. They are writing with different offsets that should be safe, but all is consistent with a bug in dispatch 4 writing zeros past the end of its normal output span.
Here we see that both are writing to the buffer at
So this looks like dispatch 4 is writing zeros past the end of its normal output span.
|
(So this is exactly what Ben was saying, and our next step is to look closely at arm64 codegen.) |
This is the matmul dispatch, the barrier, and the concurrent unset_encoding dispatch + copy:
it reads like the HAL IR just with the values available, so we can see for example the unset_encoding take buffer %r9 range [0, 128) as its binding and then the copy buffer target being %r9 range [64, (64+48)). In this case we had all the values in the IR and there wasn't a risk of that being wrong but it's a useful debugging tool when dynamic values are in play :) ASAN won't show issues by default because it's not likely writing out of bounds of the buffer but instead maybe over a slice of the buffer it shouldn't be. I'll make it a flag (like I've meant to for years!) today but swapping packStaticSlicesGreedily to packSlicesWithNoAliasing here is one useful knob: iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/LayoutSlices.cpp Lines 260 to 266 in 5145f7e
actually, @bjacob you deleted that function :P d991dc9 well, I'll deal with that later I guess. In this case I doubt it will change things as nothing is aliasing, but it's another useful tool for explorations like this. While writing this I see your posts above so none of this is likely useful, but thought I'd get it out of my head :) |
Here is the disassembly of the actual arm64 module code when it reproduced, for that dispatch. Assuming that nothing outside of the target library call function would be writing zeros to buffers, the faulty zeros writes have to be done by this code? 0000000000003898 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32>:
3898: b9400048 ldr w8, [x2]
389c: 7100051f cmp w8, #0x1
38a0: 540003a8 b.hi 0x3914 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32+0x7c>
38a4: a9bf7bfd stp x29, x30, [sp, #-0x10]!
38a8: 910003fd mov x29, sp
38ac: f940102a ldr x10, [x1, #0x20]
38b0: b9400c29 ldr w9, [x1, #0xc]
38b4: 5280030b mov w11, #0x18 ; =24
38b8: a940294c ldp x12, x10, [x10]
38bc: 9bab290d umaddl x13, w8, w11, x10
38c0: 8b09052b add x11, x9, x9, lsl #1
38c4: d37ae52a lsl x10, x9, #6
38c8: 8b08198c add x12, x12, x8, lsl #6
38cc: d37df16b lsl x11, x11, #3
38d0: 910081ad add x13, x13, #0x20
38d4: 9102818c add x12, x12, #0xa0
38d8: ad7f8182 ldp q2, q0, [x12, #-0x10]
38dc: 8b090108 add x8, x8, x9
38e0: 3cde0181 ldur q1, [x12, #-0x20]
38e4: 3dc00583 ldr q3, [x12, #0x10]
38e8: f100091f cmp x8, #0x2
38ec: 8b0a018c add x12, x12, x10
38f0: 4e803824 zip1.4s v4, v1, v0
38f4: 4e807820 zip2.4s v0, v1, v0
38f8: 4e833841 zip1.4s v1, v2, v3
38fc: 4e837842 zip2.4s v2, v2, v3
3900: ad3f01a4 stp q4, q0, [x13, #-0x20]
3904: ad0009a1 stp q1, q2, [x13]
3908: 8b0b01ad add x13, x13, x11
390c: 54fffe63 b.lo 0x38d8 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32+0x40>
3910: a8c17bfd ldp x29, x30, [sp], #0x10
3914: 2a1f03e0 mov w0, wzr
3918: d65f03c0 ret |
Given the dispatch is 2x1x1 you know it's called twice and there should be two iree_hal_cmd_dispatch_tile calls. The dispatch functions take 3 args: environment, dispatch state, and workgroup state:
... I can't immediately find where binding_ptrs[1] gets retrieved (hooray compilers)?
looks a bit suspicious though for a function expected to write only 48 bytes? I don't know arm64, but that reads like two 256bit stores at -32 and 0 meaning 64 bytes per loop iteration, and we know the function is dispatched twice? (maybe only one loop each, but that's always more than 48?) (haven't had coffee yet, hopefully that's not all wrong :) |
(EDIT - In fact my local patch was not taking effect at all so disregard this) Yes, sorry, that is caused by a local patch that I have in the above-referenced branch that sets the TileMxNxK tile size to 8, 8, 4 to remove one of the ways that things differ based on target details. Funny, this doesn't really change the symptoms, but I'll now try without it / with other values. |
So, here is where the specific padding sizes are coming from. This is a i8 x i8 -> i32 matmul on baseline arm64, so our logic in CPUMaterializeEncodingPass returns these possible tile sizes:
Now, the result shape in the source program is
Looking at the above TileMxNxK, we have one for M==2, it's TileMxNxK{2, 16, 2}, meaning M=2, N=16, K=2, so we select that one, and in terms of padding our result matrix, that becomes:
So each tile is 2x16xi32 = 128 bytes.
This reads 64 contiguous bytes, so that's half a tile. Our loop should run for 2 iterations to complete the tile.
That is storing 64 contiguous bytes: that is wrong. The destination shape is an incomplete tile, so there needs to be some extract_slice codegen. |
; Function Attrs: <Unknown attribute>
define internal i32 @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32(ptr noalias nonnull align 16 %0, ptr noalias nonnull align 16 %1, ptr noalias nonnull align 16 %2) #0 !dbg !111 {
%4 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !112
%5 = extractvalue %iree_hal_executable_dispatch_state_v0_t %4, 10, !dbg !112
%6 = load ptr, ptr %5, align 8, !dbg !112
%7 = getelementptr i32, ptr %6, i32 32, !dbg !112
%8 = ptrtoint ptr %7 to i64, !dbg !112
%9 = and i64 %8, 63, !dbg !112
%10 = icmp eq i64 %9, 0, !dbg !112
call void @llvm.assume(i1 %10), !dbg !112
%11 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !113
%12 = extractvalue %iree_hal_executable_dispatch_state_v0_t %11, 10, !dbg !113
%13 = getelementptr ptr, ptr %12, i32 1, !dbg !113
%14 = load ptr, ptr %13, align 8, !dbg !113
%15 = ptrtoint ptr %14 to i64, !dbg !113
%16 = and i64 %15, 63, !dbg !113
%17 = icmp eq i64 %16, 0, !dbg !113
call void @llvm.assume(i1 %17), !dbg !113
%18 = load %iree_hal_executable_workgroup_state_v0_t, ptr %2, align 8, !dbg !114
%19 = extractvalue %iree_hal_executable_workgroup_state_v0_t %18, 0, !dbg !114
%20 = zext i32 %19 to i64, !dbg !114
%21 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !114
%22 = extractvalue %iree_hal_executable_dispatch_state_v0_t %21, 4, !dbg !114
%23 = zext i32 %22 to i64, !dbg !114
br label %24, !dbg !115
24: ; preds = %27, %3
%25 = phi i64 [ %77, %27 ], [ %20, %3 ]
%26 = icmp slt i64 %25, 2, !dbg !115
br i1 %26, label %27, label %78, !dbg !115
27: ; preds = %24
%28 = mul i64 %25, 16, !dbg !115
%29 = add i64 %28, 0, !dbg !115
%30 = add i64 %29, 0, !dbg !115
%31 = add i64 %30, 0, !dbg !115
%32 = add i64 %31, 0, !dbg !115
%33 = getelementptr i32, ptr %7, i64 %32, !dbg !115
%34 = load <8 x i32>, ptr %33, align 4, !dbg !115
%35 = add i64 %30, 8, !dbg !115
%36 = add i64 %35, 0, !dbg !115
%37 = getelementptr i32, ptr %7, i64 %36, !dbg !115
%38 = load <8 x i32>, ptr %37, align 4, !dbg !115
%39 = shufflevector <8 x i32> %34, <8 x i32> %34, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !115
%40 = shufflevector <16 x i32> %39, <16 x i32> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>, !dbg !115
%41 = shufflevector <8 x i32> %38, <8 x i32> %38, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !115
%42 = shufflevector <16 x i32> %41, <16 x i32> %40, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !115
%43 = shufflevector <16 x i32> %42, <16 x i32> %42, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>, !dbg !115
%44 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 0, i32 1>, !dbg !115
%45 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 2, i32 3>, !dbg !115
%46 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 4, i32 5>, !dbg !115
%47 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 6, i32 7>, !dbg !115
%48 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 8, i32 9>, !dbg !115
%49 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 10, i32 11>, !dbg !115
%50 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 12, i32 13>, !dbg !115
%51 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 14, i32 15>, !dbg !115
%52 = mul i64 %25, 6, !dbg !115
%53 = add i64 %52, 0, !dbg !115
%54 = add i64 %53, 0, !dbg !115
%55 = getelementptr i32, ptr %14, i64 %54, !dbg !115
store <2 x i32> %44, ptr %55, align 4, !dbg !115
%56 = add i64 %52, 2, !dbg !115
%57 = add i64 %56, 0, !dbg !115
%58 = getelementptr i32, ptr %14, i64 %57, !dbg !115
store <2 x i32> %45, ptr %58, align 4, !dbg !115
%59 = add i64 %52, 4, !dbg !115
%60 = add i64 %59, 0, !dbg !115
%61 = getelementptr i32, ptr %14, i64 %60, !dbg !115
store <2 x i32> %46, ptr %61, align 4, !dbg !115
%62 = add i64 %52, 6, !dbg !115
%63 = add i64 %62, 0, !dbg !115
%64 = getelementptr i32, ptr %14, i64 %63, !dbg !115
store <2 x i32> %47, ptr %64, align 4, !dbg !115
%65 = add i64 %52, 8, !dbg !115
%66 = add i64 %65, 0, !dbg !115
%67 = getelementptr i32, ptr %14, i64 %66, !dbg !115
store <2 x i32> %48, ptr %67, align 4, !dbg !115
%68 = add i64 %52, 10, !dbg !115
%69 = add i64 %68, 0, !dbg !115
%70 = getelementptr i32, ptr %14, i64 %69, !dbg !115
store <2 x i32> %49, ptr %70, align 4, !dbg !115
%71 = add i64 %52, 12, !dbg !115
%72 = add i64 %71, 0, !dbg !115
%73 = getelementptr i32, ptr %14, i64 %72, !dbg !115
store <2 x i32> %50, ptr %73, align 4, !dbg !115
%74 = add i64 %52, 14, !dbg !115
%75 = add i64 %74, 0, !dbg !115
%76 = getelementptr i32, ptr %14, i64 %75, !dbg !115
store <2 x i32> %51, ptr %76, align 4, !dbg !115
%77 = add i64 %25, %23, !dbg !115
br label %24, !dbg !115
78: ; preds = %24
ret i32 0, !dbg !116
} |
We know that the 64 bytes are because of the padding. So in the case of padding up, the allocation of the buffer should consider padding as well. |
Some key steps from print-ir-after-all. @MaheshRavishankar , @hanhanW , at which step is this going wrong? The wrong behavior we are seeing here is that %7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32> that is introduced at OptimizeTensorInsertExtractSlices? Does it have extract-slice semantics to correctly avoid writing more than 3x2, or is that inherently ill-formed? Next, same question about the vector.store %14, %subview_2[%c7, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32> that is created by LLVMCPUVectorTransferLowering? If that is still correct (and should be compiled as a no-op because 7 is out of bounds in 3x2) then it's down to a LLVM vector lowering bug, and that might explain why we see this only on arm64? // -----// IR Dump After GenericVectorization (iree-codegen-generic-vectorization) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%c0_i32 = arith.constant 0 : i32
%c2 = arith.constant 2 : index
%c128 = arith.constant 128 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
%2 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>> -> tensor<1x3x2xi32>
%3 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0, 0, 0, 0], sizes = [1, 1, 1, 2, 8], strides = [1, 1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>> -> tensor<1x1x1x2x8xi32>
%extracted_slice = tensor.extract_slice %3[0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : tensor<1x1x1x2x8xi32> to tensor<2x8xi32>
%4 = tensor.empty() : tensor<8x2xi32>
%5 = vector.transfer_read %extracted_slice[%c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<2x8xi32>, vector<2x8xi32>
%6 = vector.transpose %5, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
%7 = vector.transfer_write %6, %4[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<8x2xi32>
%extracted_slice_0 = tensor.extract_slice %7[0, 0] [3, 2] [1, 1] : tensor<8x2xi32> to tensor<3x2xi32>
%inserted_slice = tensor.insert_slice %extracted_slice_0 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>
flow.dispatch.tensor.store %inserted_slice, %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : tensor<1x3x2xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
}
return
}
// -----// IR Dump After OptimizeTensorInsertExtractSlices (iree-codegen-optimize-tensor-insert-extract-slices) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%c0_i32 = arith.constant 0 : i32
%c2 = arith.constant 2 : index
%c128 = arith.constant 128 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
%2 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>> -> tensor<1x3x2xi32>
%3 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0, 0, 0, 0], sizes = [1, 1, 1, 2, 8], strides = [1, 1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>> -> tensor<1x1x1x2x8xi32>
%4 = vector.transfer_read %3[%c0, %c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<1x1x1x2x8xi32>, vector<2x8xi32>
%5 = vector.transpose %4, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
%6 = tensor.empty() : tensor<3x2xi32>
%7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32>
%inserted_slice = tensor.insert_slice %7 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>
flow.dispatch.tensor.store %inserted_slice, %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : tensor<1x3x2xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
}
return
}
// -----// IR Dump After LLVMCPUVectorTransferLowering (iree-llvmcpu-vector-transfer-lowering) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%c7 = arith.constant 7 : index
%c6 = arith.constant 6 : index
%c5 = arith.constant 5 : index
%c4 = arith.constant 4 : index
%c3 = arith.constant 3 : index
%c1 = arith.constant 1 : index
%cst = arith.constant dense<0> : vector<2x8xi32>
%c2 = arith.constant 2 : index
%c128 = arith.constant 128 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %0, 64 : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %1, 64 : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
%subview = memref.subview %1[%arg0, 0, 0] [1, 3, 2] [1, 1, 1] : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>> to memref<1x3x2xi32, strided<[6, 2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_0 = memref.subview %0[%arg0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>> to memref<1x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_1 = memref.subview %subview_0[0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : memref<1x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>
%2 = vector.load %subview_1[%c0, %c0] : memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>, vector<8xi32>
%3 = vector.insert %2, %cst [0] : vector<8xi32> into vector<2x8xi32>
%4 = vector.load %subview_1[%c1, %c0] : memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>, vector<8xi32>
%5 = vector.insert %4, %3 [1] : vector<8xi32> into vector<2x8xi32>
%6 = vector.transpose %5, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
%subview_2 = memref.subview %subview[0, 0, 0] [1, 3, 2] [1, 1, 1] : memref<1x3x2xi32, strided<[6, 2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%7 = vector.extract %6[0] : vector<2xi32> from vector<8x2xi32>
vector.store %7, %subview_2[%c0, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%8 = vector.extract %6[1] : vector<2xi32> from vector<8x2xi32>
vector.store %8, %subview_2[%c1, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%9 = vector.extract %6[2] : vector<2xi32> from vector<8x2xi32>
vector.store %9, %subview_2[%c2, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%10 = vector.extract %6[3] : vector<2xi32> from vector<8x2xi32>
vector.store %10, %subview_2[%c3, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%11 = vector.extract %6[4] : vector<2xi32> from vector<8x2xi32>
vector.store %11, %subview_2[%c4, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%12 = vector.extract %6[5] : vector<2xi32> from vector<8x2xi32>
vector.store %12, %subview_2[%c5, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%13 = vector.extract %6[6] : vector<2xi32> from vector<8x2xi32>
vector.store %13, %subview_2[%c6, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
%14 = vector.extract %6[7] : vector<2xi32> from vector<8x2xi32>
vector.store %14, %subview_2[%c7, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
}
return
} |
Note: if I drop |
nice investigation benoit!
this is the unset_encoding, which should be going from padded to unpadded, so it's the opposite case (IIUC) |
(unrelated to this issue but it does highlight it: dispatching two workgroups to handle 64 bytes each is absurd, so this may be something to keep in mind as we try to make these more efficient - there's a lot of instructions between workgroup invocations including potential thread hops, so we want to be shooting for processing 1000-100000 elements and rarely ever <1000 - 128k-256k for parallel memcpy is kind of the minimum) |
The change from
to
should at the very least change the |
It seems to me like the issue is that the transfer write is still in bounds after This write is clearly the problem:
I guess we can make that a masked write, but it should be out of bounds at least. |
Looks like the pattern responsible for this rewrite is conservatively updating the in_bounds atrribute: Based on the comment there, I guess this is an issue with the transfer_write folder: |
Looking at the IR more fixing the in bounds attribute should fix the error, but needs care to not pessimize too much |
The additional action item could be improving transfer_read/write's verifier. If there are no masks and the shape is static, we should check the size and in_bounds attributes. |
As pointed out by @MaheshRavishankar and @hanhanW , can confirm the problem is caused by this line: iree/compiler/src/iree/compiler/Codegen/Common/FoldTensorSubsetIntoVectorTransferOps.cpp Line 284 in e52b8bf
|
Bug fixed in #17563 |
What happened?
Recently a patch is revered: ab8f668 as it caused intermittent test failures on ARM64 builds.
Further investigation shows that on arm64, threadsanitizer complains data race in memset, however the tsan stack trace does not give much insight into which part exactly caused it. ASan did not complain on failed test runs.
The reproduce rate is <10%, MacOS and Arm64 Linux are reproducible. The TSan log is attached.
LastTest.log
Steps to reproduce your issue
-DIREE_ENABLE_TSAN=ON
on an ARM machinectest -R narrow_n_matmul
and see the error.What component(s) does this issue relate to?
Runtime
Version information
commit ab8f668
Additional context
No response
The text was updated successfully, but these errors were encountered: