-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Add intrinsic for raw atomic buffer loads #97707
Conversation
This adds llvm.amdgcn.raw.atomic.buffer.load intrinsic to support OpAtomicLoad lowering on AMDGPU. Previously this was lowered to llvm.amdgcn.raw.buffer.load which caused the load in some cases to be marked as invariant and hoisted in LICM. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Mariusz Sikora <mariusz.sikora@amd.com> Co-authored-by: Jessica Del <50999226+OutOfCache@users.noreply.github.com>
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-amdgpu Author: Jessica Del (OutOfCache) ChangesUpstream the intrinsics These intrinsics prevent LICM for atomic loads. Continuation of https://reviews.llvm.org/D138786. Patch is 31.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97707.diff 7 Files Affected:
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 71b1e832bde3c..dc53e9e778a4f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -1116,6 +1116,19 @@ class AMDGPURawBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsi
def int_amdgcn_raw_buffer_load_format : AMDGPURawBufferLoad<llvm_anyfloat_ty>;
def int_amdgcn_raw_buffer_load : AMDGPURawBufferLoad;
+class AMDGPURawAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+ [data_ty],
+ [llvm_v4i32_ty, // rsrc(SGPR)
+ llvm_i32_ty, // offset(VGPR/imm, included in bounds checking and swizzling)
+ llvm_i32_ty, // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+ llvm_i32_ty], // auxiliary data (imm, cachepolicy (bit 0 = glc,
+ // bit 1 = slc,
+ // bit 2 = dlc on gfx10+),
+ // swizzled buffer (bit 3 = swz))
+ [ImmArg<ArgIndex<3>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+ AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_raw_atomic_buffer_load : AMDGPURawAtomicBufferLoad;
+
class AMDGPURawPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
[data_ty],
[AMDGPUBufferRsrcTy, // rsrc(SGPR)
@@ -1134,6 +1147,19 @@ class AMDGPURawPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntri
def int_amdgcn_raw_ptr_buffer_load_format : AMDGPURawPtrBufferLoad<llvm_anyfloat_ty>;
def int_amdgcn_raw_ptr_buffer_load : AMDGPURawPtrBufferLoad;
+class AMDGPURawAtomicPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+ [data_ty],
+ [AMDGPUBufferRsrcTy,// rsrc(SGPR)
+ llvm_i32_ty, // offset(VGPR/imm, included in bounds checking and swizzling)
+ llvm_i32_ty, // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+ llvm_i32_ty], // auxiliary data (imm, cachepolicy (bit 0 = glc,
+ // bit 1 = slc,
+ // bit 2 = dlc on gfx10+),
+ // swizzled buffer (bit 3 = swz))
+ [IntrArgMemOnly, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<3>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+ AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_raw_atomic_ptr_buffer_load : AMDGPURawAtomicPtrBufferLoad;
+
class AMDGPUStructBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
[data_ty],
[llvm_v4i32_ty, // rsrc(SGPR)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 93bca4402ed23..11aac1f22524d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -1245,7 +1245,9 @@ static Value *simplifyAMDGCNMemoryIntrinsicDemanded(InstCombiner &IC,
unsigned OffsetIdx;
switch (II.getIntrinsicID()) {
case Intrinsic::amdgcn_raw_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
case Intrinsic::amdgcn_raw_ptr_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load:
OffsetIdx = 1;
break;
case Intrinsic::amdgcn_s_buffer_load:
@@ -1378,6 +1380,8 @@ std::optional<Value *> GCNTTIImpl::simplifyDemandedVectorEltsIntrinsic(
case Intrinsic::amdgcn_raw_ptr_buffer_load:
case Intrinsic::amdgcn_raw_buffer_load_format:
case Intrinsic::amdgcn_raw_ptr_buffer_load_format:
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load:
case Intrinsic::amdgcn_raw_tbuffer_load:
case Intrinsic::amdgcn_raw_ptr_tbuffer_load:
case Intrinsic::amdgcn_s_buffer_load:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index a219d01518458..9411689d2d858 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7345,6 +7345,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
return legalizeBufferStore(MI, MRI, B, true, true);
case Intrinsic::amdgcn_raw_buffer_load:
case Intrinsic::amdgcn_raw_ptr_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load:
case Intrinsic::amdgcn_struct_buffer_load:
case Intrinsic::amdgcn_struct_ptr_buffer_load:
return legalizeBufferLoad(MI, MRI, B, false, false);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 9e7694f41d6b8..f497295d8b5e5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4984,6 +4984,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
}
case Intrinsic::amdgcn_raw_buffer_load:
case Intrinsic::amdgcn_raw_ptr_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load:
case Intrinsic::amdgcn_raw_tbuffer_load:
case Intrinsic::amdgcn_raw_ptr_tbuffer_load: {
// FIXME: Should make intrinsic ID the last operand of the instruction,
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 12977af0d7e85..c1c320173ae9a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1272,6 +1272,14 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
Info.ptrVal = CI.getArgOperand(1);
return true;
}
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load: {
+ Info.memVT =
+ memVTFromLoadIntrReturn(*this, MF.getDataLayout(), CI.getType(),
+ std::numeric_limits<unsigned>::max());
+ Info.flags &= ~MachineMemOperand::MOStore;
+ return true;
+ }
}
}
return true;
@@ -8897,6 +8905,8 @@ SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
}
case Intrinsic::amdgcn_raw_buffer_load:
case Intrinsic::amdgcn_raw_ptr_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_buffer_load:
+ case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load:
case Intrinsic::amdgcn_raw_buffer_load_format:
case Intrinsic::amdgcn_raw_ptr_buffer_load_format: {
const bool IsFormat =
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll
new file mode 100644
index 0000000000000..44f1089e4554b
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll
@@ -0,0 +1,304 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=0 | FileCheck %s -check-prefix=CHECK
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=1 | FileCheck %s -check-prefix=CHECK
+
+define amdgpu_kernel void @raw_atomic_buffer_load_i32(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_i32:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB0_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB0_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call i32 @llvm.amdgcn.raw.atomic.buffer.load.i32(<4 x i32> %addr, i32 0, i32 0, i32 1)
+ %cmp = icmp eq i32 %load, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_i32_off(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_i32_off:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB1_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB1_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call i32 @llvm.amdgcn.raw.atomic.buffer.load.i32(<4 x i32> %addr, i32 0, i32 0, i32 1)
+ %cmp = icmp eq i32 %load, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+define amdgpu_kernel void @raw_atomic_buffer_load_i32_soff(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_i32_soff:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB2_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 4 offset:4 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB2_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call i32 @llvm.amdgcn.raw.atomic.buffer.load.i32(<4 x i32> %addr, i32 4, i32 4, i32 1)
+ %cmp = icmp eq i32 %load, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+define amdgpu_kernel void @raw_atomic_buffer_load_i32_dlc(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_i32_dlc:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB3_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 offset:4 dlc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB3_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call i32 @llvm.amdgcn.raw.atomic.buffer.load.i32(<4 x i32> %addr, i32 4, i32 0, i32 4)
+ %cmp = icmp eq i32 %load, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_nonatomic_buffer_load_i32(<4 x i32> %addr) {
+; CHECK-LABEL: raw_nonatomic_buffer_load_i32:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 offset:4 glc
+; CHECK-NEXT: s_mov_b32 s0, 0
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: .LBB4_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_and_b32 s1, exec_lo, vcc_lo
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_or_b32 s0, s1, s0
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
+; CHECK-NEXT: s_cbranch_execnz .LBB4_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %addr, i32 4, i32 0, i32 1)
+ %cmp = icmp eq i32 %load, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_i64(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_i64:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: v_mov_b32_e32 v1, 0
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB5_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b64 v[2:3], off, s[0:3], 0 offset:4 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u64_e32 vcc_lo, v[2:3], v[0:1]
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB5_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ %id.zext = zext i32 %id to i64
+ br label %bb1
+bb1:
+ %load = call i64 @llvm.amdgcn.raw.atomic.buffer.load.i64(<4 x i32> %addr, i32 4, i32 0, i32 1)
+ %cmp = icmp eq i64 %load, %id.zext
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_v2i16(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_v2i16:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB6_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB6_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call <2 x i16> @llvm.amdgcn.raw.atomic.buffer.load.v2i16(<4 x i32> %addr, i32 0, i32 0, i32 1)
+ %bitcast = bitcast <2 x i16> %load to i32
+ %cmp = icmp eq i32 %bitcast, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_v4i16(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_v4i16:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB7_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b64 v[1:2], off, s[0:3], 0 offset:4 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT: v_lshl_or_b32 v1, v2, 16, v1
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB7_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call <4 x i16> @llvm.amdgcn.raw.atomic.buffer.load.v4i16(<4 x i32> %addr, i32 4, i32 0, i32 1)
+ %shortened = shufflevector <4 x i16> %load, <4 x i16> poison, <2 x i32> <i32 0, i32 2>
+ %bitcast = bitcast <2 x i16> %shortened to i32
+ %cmp = icmp eq i32 %bitcast, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_v4i32(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_v4i32:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB8_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b128 v[1:4], off, s[0:3], 0 offset:4 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v4, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB8_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call <4 x i32> @llvm.amdgcn.raw.atomic.buffer.load.v4i32(<4 x i32> %addr, i32 4, i32 0, i32 1)
+ %extracted = extractelement <4 x i32> %load, i32 3
+ %cmp = icmp eq i32 %extracted, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+define amdgpu_kernel void @raw_atomic_buffer_load_ptr(<4 x i32> %addr) {
+; CHECK-LABEL: raw_atomic_buffer_load_ptr:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB9_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b64 v[1:2], off, s[0:3], 0 offset:4 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_load_b32 v1, v[1:2]
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB9_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-NEXT: s_endpgm
+bb:
+ %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+ br label %bb1
+bb1:
+ %load = call ptr @llvm.amdgcn.raw.atomic.buffer.load.ptr(<4 x i32> %addr, i32 4, i32 0, i32 1)
+ %elem = load i32, ptr %load
+ %cmp = icmp eq i32 %elem, %id
+ br i1 %cmp, label %bb1, label %bb2
+bb2:
+ ret void
+}
+
+; Function Attrs: nounwind readonly
+declare i32 @llvm.amdgcn.raw.atomic.buffer.load.i32(<4 x i32>, i32, i32, i32 immarg)
+declare i64 @llvm.amdgcn.raw.atomic.buffer.load.i64(<4 x i32>, i32, i32, i32 immarg)
+declare <2 x i16> @llvm.amdgcn.raw.atomic.buffer.load.v2i16(<4 x i32>, i32, i32, i32 immarg)
+declare <4 x i16> @llvm.amdgcn.raw.atomic.buffer.load.v4i16(<4 x i32>, i32, i32, i32 immarg)
+declare <4 x i32> @llvm.amdgcn.raw.atomic.buffer.load.v4i32(<4 x i32>, i32, i32, i32 immarg)
+declare ptr @llvm.amdgcn.raw.atomic.buffer.load.ptr(<4 x i32>, i32, i32, i32 immarg)
+declare i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32>, i32, i32, i32 immarg)
+declare i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.ptr.buffer.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.ptr.buffer.load.ll
new file mode 100644
index 0000000000000..987aa43aa4242
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.ptr.buffer.load.ll
@@ -0,0 +1,304 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=0 | FileCheck %s -check-prefix=CHECK
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=1 | FileCheck %s -check-prefix=CHECK
+
+define amdgpu_kernel void @raw_atomic_ptr_buffer_ptr_load_i32(ptr addrspace(8) %ptr) {
+; CHECK-LABEL: raw_atomic_ptr_buffer_ptr_load_i32:
+; CHECK: ; %bb.0: ; %bb
+; CHECK-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: .LBB0_1: ; %bb1
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: buffer_load_b32 v1, off, s[0:3], 0 glc
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT: s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execnz .LBB0_1
+; CHECK-NEXT: ; %bb.2: ; %bb2
+; CHECK-...
[truncated]
|
It's not immediately obvious (to me) how it prevents it, and why it should be prevented. |
case Intrinsic::amdgcn_raw_atomic_buffer_load: | ||
case Intrinsic::amdgcn_raw_atomic_ptr_buffer_load: { | ||
Info.memVT = | ||
memVTFromLoadIntrReturn(*this, MF.getDataLayout(), CI.getType(), | ||
std::numeric_limits<unsigned>::max()); | ||
Info.flags &= ~MachineMemOperand::MOStore; | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my attempt to address the comment here about fixing the intrinsics bypassing the atomic memory semantics. CodeGen failed because of the MOStore
flag set for other atomics. If there is a better way to do this, I'd love to hear some feedback.
You might want to update the fat buffer pointer lowering as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you go ahead and address the TODO at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp#L1095-L1096 ? That's left over from this intrinsic not being upstreamed
llvm_i32_ty], // auxiliary data (imm, cachepolicy (bit 0 = glc, | ||
// bit 1 = slc, | ||
// bit 2 = dlc on gfx10+), | ||
// swizzled buffer (bit 3 = swz)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, weird ask ... could we get a way to put the syncscope (maybe as a metadata arg) on the intrinsic? That feels like it'd be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so it seems like that should be done for all of the buffer intrinsics, and would be orthogonal to this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we should carry this as an operand bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that'd be a change to all the buffer atomics ... but that's orthogonal to this patch, agreed
I believe I did in the last fixup commit I removed the addition in InstCombine so we can add them in a separate PR with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't notice the change.
This looks fine to me but I don't think I can sign off on landing this
Add a "ptr" form of the downstream-only llvm.amdgcn.raw.atomic.buffer.load intrinsic. For upstream intrinsics this was done by: https://reviews.llvm.org/D147547 "[AMDGPU] Add buffer intrinsics that take resources as pointers" Co-authored-by: Jessica Del <50999226+OutOfCache@users.noreply.github.com>
93b6b5b
to
551a6f8
Compare
@@ -1116,6 +1116,19 @@ class AMDGPURawBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsi | |||
def int_amdgcn_raw_buffer_load_format : AMDGPURawBufferLoad<llvm_anyfloat_ty>; | |||
def int_amdgcn_raw_buffer_load : AMDGPURawBufferLoad; | |||
|
|||
class AMDGPURawAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic < | |||
[data_ty], | |||
[llvm_v4i32_ty, // rsrc(SGPR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce a new buffer-as-vector intrinsic? Should use the new pointer approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the reasoning, however, in llpc we still use the vector variants. That's why we could still use both.
; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=0 | FileCheck %s -check-prefix=CHECK | ||
; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -global-isel=1 | FileCheck %s -check-prefix=CHECK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need -verify-machineinstrs
Two tests added in this commit seem to be failing on a build bot https://lab.llvm.org/buildbot/#/builders/174/builds/2102. Can you take a look and revert if you need time to investigate? |
The same thing is happening on AArch64 buildbots: https://lab.llvm.org/buildbot/#/builders/125/builds/867 |
Thanks for letting me know. Seems like mostly the register numbers changed in the check lines. Here is a PR for the fix: #99912 |
Upstream the intrinsics `llvm.amdgcn.raw.atomic.buffer.load` and `llvm.amdgcn.raw.atomic.ptr.buffer.load`. These additional intrinsics mark atomic buffer loads as atomic to LLVM by removing the `IntrReadMem` attribute. Otherwise, it could hoist these intrinsics out of loops in cases where LLVM marks them as invariant. That can cause issues such as infinite loops. Continuation of https://reviews.llvm.org/D138786 with the additional use in the fat buffer lowering, more test cases and the additional ptr versions of these intrinsics. --------- Co-authored-by: rtayl <> Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Mariusz Sikora <mariusz.sikora@amd.com>
Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where the load is considered invariant. Similar to #97707, but for struct buffer loads.
Upstream the intrinsics `llvm.amdgcn.raw.atomic.buffer.load` and `llvm.amdgcn.raw.atomic.ptr.buffer.load`. These additional intrinsics mark atomic buffer loads as atomic to LLVM by removing the `IntrReadMem` attribute. Otherwise, it could hoist these intrinsics out of loops in cases where LLVM marks them as invariant. That can cause issues such as infinite loops. Continuation of https://reviews.llvm.org/D138786 with the additional use in the fat buffer lowering, more test cases and the additional ptr versions of these intrinsics. --------- Co-authored-by: rtayl <> Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Mariusz Sikora <mariusz.sikora@amd.com>
Summary: Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where the load is considered invariant. Similar to #97707, but for struct buffer loads. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250668
Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where the load is considered invariant. Similar to llvm#97707, but for struct buffer loads.
Upstream the intrinsics
llvm.amdgcn.raw.atomic.buffer.load
and
llvm.amdgcn.raw.atomic.ptr.buffer.load
.These additional intrinsics mark atomic buffer loads
as atomic to LLVM by removing the
IntrReadMem
attribute. Otherwise, it could hoist these
intrinsics out of loops in cases where LLVM marks
them as invariant. That can cause issues such as
infinite loops.
Continuation of https://reviews.llvm.org/D138786
with the additional use in the fat buffer lowering,
more test cases and the additional ptr versions
of these intrinsics.