Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: Legalize atomicrmw fadd for v2f16/v2bf16 for local memory #95393

Merged

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Jun 13, 2024

Make this legal for gfx940 and gfx12

Copy link
Contributor Author

arsenm commented Jun 13, 2024

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 13, 2024

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Make this legal for gfx940 and gfx12


Patch is 49.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95393.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+4-1)
  • (modified) llvm/lib/Target/AMDGPU/DSInstructions.td (+6)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+13-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (+2-29)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (+2-25)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+8-198)
  • (modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll (+332-60)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index ee7fb20c23aa7..0234cd9088ae7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -301,6 +301,9 @@ static const LLT V10S16 = LLT::fixed_vector(10, 16);
 static const LLT V12S16 = LLT::fixed_vector(12, 16);
 static const LLT V16S16 = LLT::fixed_vector(16, 16);
 
+static const LLT V2F16 = LLT::fixed_vector(2, LLT::float16());
+static const LLT V2BF16 = V2F16; // FIXME
+
 static const LLT V2S32 = LLT::fixed_vector(2, 32);
 static const LLT V3S32 = LLT::fixed_vector(3, 32);
 static const LLT V4S32 = LLT::fixed_vector(4, 32);
@@ -1638,7 +1641,7 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
     if (ST.hasLdsAtomicAddF64())
       Atomic.legalFor({{S64, LocalPtr}});
     if (ST.hasAtomicDsPkAdd16Insts())
-      Atomic.legalFor({{V2S16, LocalPtr}});
+      Atomic.legalFor({{V2F16, LocalPtr}, {V2BF16, LocalPtr}});
   }
   if (ST.hasAtomicFaddInsts())
     Atomic.legalFor({{S32, GlobalPtr}});
diff --git a/llvm/lib/Target/AMDGPU/DSInstructions.td b/llvm/lib/Target/AMDGPU/DSInstructions.td
index 5150641c50ea2..e22f8b7d46171 100644
--- a/llvm/lib/Target/AMDGPU/DSInstructions.td
+++ b/llvm/lib/Target/AMDGPU/DSInstructions.td
@@ -1082,6 +1082,12 @@ defm : DSAtomicRetNoRetPat_mc<DS_MAX_RTN_U32, DS_MAX_U32, i32, "atomic_load_umax
 defm : DSAtomicRetNoRetPat_mc<DS_MIN_RTN_F32, DS_MIN_F32, f32, "atomic_load_fmin">;
 defm : DSAtomicRetNoRetPat_mc<DS_MAX_RTN_F32, DS_MAX_F32, f32, "atomic_load_fmax">;
 
+
+let SubtargetPredicate = HasAtomicDsPkAdd16Insts in {
+defm : DSAtomicRetNoRetPat_mc<DS_PK_ADD_RTN_F16, DS_PK_ADD_F16, v2f16, "atomic_load_fadd">;
+defm : DSAtomicRetNoRetPat_mc<DS_PK_ADD_RTN_BF16, DS_PK_ADD_BF16, v2bf16, "atomic_load_fadd">;
+}
+
 let SubtargetPredicate = isGFX6GFX7GFX8GFX9GFX10 in {
 defm : DSAtomicCmpXChgSwapped_mc<DS_CMPST_RTN_B32, DS_CMPST_B32, i32, "atomic_cmp_swap">;
 }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81098201e9c0f..6325e75aa8f96 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15931,6 +15931,16 @@ static OptimizationRemark emitAtomicRMWLegalRemark(const AtomicRMWInst *RMW) {
          << " operation at memory scope " << MemScope;
 }
 
+static bool isHalf2OrBFloat2(Type *Ty) {
+  if (FixedVectorType *VT = dyn_cast<FixedVectorType>(Ty)) {
+    Type *EltTy = VT->getElementType();
+    return VT->getNumElements() == 2 &&
+           (EltTy->isHalfTy() || EltTy->isBFloatTy());
+  }
+
+  return false;
+}
+
 TargetLowering::AtomicExpansionKind
 SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
   unsigned AS = RMW->getPointerAddressSpace();
@@ -15989,7 +15999,9 @@ SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
                                                  : AtomicExpansionKind::CmpXChg;
       }
 
-      // TODO: Handle v2f16/v2bf16 cases for gfx940
+      if (Subtarget->hasAtomicDsPkAdd16Insts() && isHalf2OrBFloat2(Ty))
+        return AtomicExpansionKind::None;
+
       return AtomicExpansionKind::CmpXChg;
     }
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
index 93c30a6a01e00..4e21ef8379342 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
@@ -213,22 +213,8 @@ define <2 x half> @local_atomic_fadd_ret_v2f16_offset(ptr addrspace(3) %ptr, <2
 ; GFX940-LABEL: local_atomic_fadd_ret_v2f16_offset:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v2, v0 offset:65532
-; GFX940-NEXT:    s_mov_b64 s[0:1], 0
-; GFX940-NEXT:  .LBB15_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX940-NEXT:    ds_pk_add_rtn_f16 v0, v0, v1 offset:65532
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b32_e32 v3, v2
-; GFX940-NEXT:    v_pk_add_f16 v2, v3, v1
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v2, v0, v3, v2 offset:65532
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
-; GFX940-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    s_cbranch_execnz .LBB15_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %gep = getelementptr <2 x half>, ptr addrspace(3) %ptr, i32 16383
   %result = atomicrmw fadd ptr addrspace(3) %gep, <2 x half> %val seq_cst
@@ -239,21 +225,8 @@ define void @local_atomic_fadd_noret_v2f16_offset(ptr addrspace(3) %ptr, <2 x ha
 ; GFX940-LABEL: local_atomic_fadd_noret_v2f16_offset:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v2, v0 offset:65532
-; GFX940-NEXT:    s_mov_b64 s[0:1], 0
-; GFX940-NEXT:  .LBB16_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX940-NEXT:    ds_pk_add_f16 v0, v1 offset:65532
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_pk_add_f16 v3, v2, v1
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v3, v0, v2, v3 offset:65532
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v2
-; GFX940-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT:    v_mov_b32_e32 v2, v3
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    s_cbranch_execnz .LBB16_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[0:1]
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %gep = getelementptr <2 x half>, ptr addrspace(3) %ptr, i32 16383
   %unused = atomicrmw fadd ptr addrspace(3) %gep, <2 x half> %val seq_cst
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll
index 5724cf471bae3..d4a7f3b2d387d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll
@@ -52,36 +52,13 @@ define float @test_atomicrmw_fsub(ptr addrspace(3) %addr) {
 define <2 x half> @test_atomicrmw_fadd_vector(ptr addrspace(3) %addr) {
   ; CHECK-LABEL: name: test_atomicrmw_fadd_vector
   ; CHECK: bb.1 (%ir-block.0):
-  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT:   liveins: $vgpr0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(p3) = COPY $vgpr0
   ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s16) = G_FCONSTANT half 0xH3C00
   ; CHECK-NEXT:   [[BUILD_VECTOR:%[0-9]+]]:_(<2 x s16>) = G_BUILD_VECTOR [[C]](s16), [[C]](s16)
-  ; CHECK-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
-  ; CHECK-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x s16>) = G_LOAD [[COPY]](p3) :: (load (<2 x s16>) from %ir.addr, addrspace 3)
-  ; CHECK-NEXT:   G_BR %bb.2
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.atomicrmw.start:
-  ; CHECK-NEXT:   successors: %bb.3(0x40000000), %bb.2(0x40000000)
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:_(s64) = G_PHI %19(s64), %bb.2, [[C1]](s64), %bb.1
-  ; CHECK-NEXT:   [[PHI1:%[0-9]+]]:_(<2 x s16>) = G_PHI [[LOAD]](<2 x s16>), %bb.1, %18(<2 x s16>), %bb.2
-  ; CHECK-NEXT:   [[FADD:%[0-9]+]]:_(<2 x s16>) = G_FADD [[PHI1]], [[BUILD_VECTOR]]
-  ; CHECK-NEXT:   [[BITCAST:%[0-9]+]]:_(s32) = G_BITCAST [[FADD]](<2 x s16>)
-  ; CHECK-NEXT:   [[BITCAST1:%[0-9]+]]:_(s32) = G_BITCAST [[PHI1]](<2 x s16>)
-  ; CHECK-NEXT:   [[ATOMIC_CMPXCHG_WITH_SUCCESS:%[0-9]+]]:_(s32), [[ATOMIC_CMPXCHG_WITH_SUCCESS1:%[0-9]+]]:_(s1) = G_ATOMIC_CMPXCHG_WITH_SUCCESS [[COPY]](p3), [[BITCAST1]], [[BITCAST]] :: (load store seq_cst seq_cst (s32) on %ir.addr, addrspace 3)
-  ; CHECK-NEXT:   [[BITCAST2:%[0-9]+]]:_(<2 x s16>) = G_BITCAST [[ATOMIC_CMPXCHG_WITH_SUCCESS]](s32)
-  ; CHECK-NEXT:   [[INT:%[0-9]+]]:_(s64) = G_INTRINSIC intrinsic(@llvm.amdgcn.if.break), [[ATOMIC_CMPXCHG_WITH_SUCCESS1]](s1), [[PHI]](s64)
-  ; CHECK-NEXT:   [[INT1:%[0-9]+]]:_(s1) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.loop), [[INT]](s64)
-  ; CHECK-NEXT:   G_BRCOND [[INT1]](s1), %bb.3
-  ; CHECK-NEXT:   G_BR %bb.2
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.3.atomicrmw.end:
-  ; CHECK-NEXT:   [[PHI2:%[0-9]+]]:_(<2 x s16>) = G_PHI [[BITCAST2]](<2 x s16>), %bb.2
-  ; CHECK-NEXT:   [[PHI3:%[0-9]+]]:_(s64) = G_PHI [[INT]](s64), %bb.2
-  ; CHECK-NEXT:   G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.end.cf), [[PHI3]](s64)
-  ; CHECK-NEXT:   $vgpr0 = COPY [[PHI2]](<2 x s16>)
+  ; CHECK-NEXT:   [[ATOMICRMW_FADD:%[0-9]+]]:_(<2 x s16>) = G_ATOMICRMW_FADD [[COPY]](p3), [[BUILD_VECTOR]] :: (load store seq_cst (<2 x s16>) on %ir.addr, addrspace 3)
+  ; CHECK-NEXT:   $vgpr0 = COPY [[ATOMICRMW_FADD]](<2 x s16>)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
   %oldval = atomicrmw fadd ptr addrspace(3) %addr, <2 x half> <half 1.0, half 1.0> seq_cst
   ret <2 x half> %oldval
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll b/llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll
index 4373b76070e32..e34a33ebdbb04 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll
@@ -3030,22 +3030,8 @@ define <2 x half> @lds_atomic_fadd_ret_v2f16(ptr addrspace(3) %ptr, <2 x half> %
 ; GFX940-LABEL: lds_atomic_fadd_ret_v2f16:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v2, v0
-; GFX940-NEXT:    s_mov_b64 s[0:1], 0
-; GFX940-NEXT:  .LBB14_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b32_e32 v3, v2
-; GFX940-NEXT:    v_pk_add_f16 v2, v3, v1
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v2, v0, v3, v2
+; GFX940-NEXT:    ds_pk_add_rtn_f16 v0, v0, v1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
-; GFX940-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    s_cbranch_execnz .LBB14_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: lds_atomic_fadd_ret_v2f16:
@@ -3055,26 +3041,10 @@ define <2 x half> @lds_atomic_fadd_ret_v2f16(ptr addrspace(3) %ptr, <2 x half> %
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    ds_load_b32 v2, v0
-; GFX12-NEXT:    s_mov_b32 s0, 0
-; GFX12-NEXT:  .LBB14_1: ; %atomicrmw.start
-; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX12-NEXT:    s_wait_dscnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v3, v2
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12-NEXT:    v_pk_add_f16 v2, v3, v1
 ; GFX12-NEXT:    s_wait_storecnt 0x0
-; GFX12-NEXT:    ds_cmpstore_rtn_b32 v2, v0, v2, v3
+; GFX12-NEXT:    ds_pk_add_rtn_f16 v0, v0, v1
 ; GFX12-NEXT:    s_wait_dscnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_SE
-; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v3
-; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
-; GFX12-NEXT:    s_cbranch_execnz .LBB14_1
-; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX12-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX7-LABEL: lds_atomic_fadd_ret_v2f16:
@@ -3231,21 +3201,8 @@ define void @lds_atomic_fadd_noret_v2f16(ptr addrspace(3) %ptr, <2 x half> %val)
 ; GFX940-LABEL: lds_atomic_fadd_noret_v2f16:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v2, v0
-; GFX940-NEXT:    s_mov_b64 s[0:1], 0
-; GFX940-NEXT:  .LBB15_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX940-NEXT:    ds_pk_add_f16 v0, v1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_pk_add_f16 v3, v2, v1
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v3, v0, v2, v3
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v2
-; GFX940-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT:    v_mov_b32_e32 v2, v3
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[0:1]
-; GFX940-NEXT:    s_cbranch_execnz .LBB15_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[0:1]
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: lds_atomic_fadd_noret_v2f16:
@@ -3255,24 +3212,10 @@ define void @lds_atomic_fadd_noret_v2f16(ptr addrspace(3) %ptr, <2 x half> %val)
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    ds_load_b32 v2, v0
-; GFX12-NEXT:    s_mov_b32 s0, 0
-; GFX12-NEXT:  .LBB15_1: ; %atomicrmw.start
-; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX12-NEXT:    s_wait_dscnt 0x0
-; GFX12-NEXT:    v_pk_add_f16 v3, v2, v1
 ; GFX12-NEXT:    s_wait_storecnt 0x0
-; GFX12-NEXT:    ds_cmpstore_rtn_b32 v3, v0, v3, v2
+; GFX12-NEXT:    ds_pk_add_f16 v0, v1
 ; GFX12-NEXT:    s_wait_dscnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_SE
-; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v2
-; GFX12-NEXT:    v_mov_b32_e32 v2, v3
-; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
-; GFX12-NEXT:    s_cbranch_execnz .LBB15_1
-; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX7-LABEL: lds_atomic_fadd_noret_v2f16:
@@ -3483,41 +3426,8 @@ define <2 x bfloat> @lds_atomic_fadd_ret_v2bf16(ptr addrspace(3) %ptr, <2 x bflo
 ; GFX940-LABEL: lds_atomic_fadd_ret_v2bf16:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v2, v0
-; GFX940-NEXT:    s_mov_b64 s[2:3], 0
-; GFX940-NEXT:    v_lshlrev_b32_e32 v3, 16, v1
-; GFX940-NEXT:    s_movk_i32 s4, 0x7fff
-; GFX940-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX940-NEXT:    s_mov_b32 s5, 0x7060302
-; GFX940-NEXT:  .LBB16_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b32_e32 v4, v2
-; GFX940-NEXT:    v_lshlrev_b32_e32 v2, 16, v4
-; GFX940-NEXT:    v_and_b32_e32 v5, 0xffff0000, v4
-; GFX940-NEXT:    v_add_f32_e32 v2, v2, v3
-; GFX940-NEXT:    v_add_f32_e32 v5, v5, v1
-; GFX940-NEXT:    v_bfe_u32 v6, v2, 16, 1
-; GFX940-NEXT:    v_bfe_u32 v8, v5, 16, 1
-; GFX940-NEXT:    v_or_b32_e32 v7, 0x400000, v2
-; GFX940-NEXT:    v_or_b32_e32 v9, 0x400000, v5
-; GFX940-NEXT:    v_add3_u32 v6, v6, v2, s4
-; GFX940-NEXT:    v_add3_u32 v8, v8, v5, s4
-; GFX940-NEXT:    v_cmp_u_f32_e32 vcc, v5, v5
-; GFX940-NEXT:    v_cmp_u_f32_e64 s[0:1], v2, v2
-; GFX940-NEXT:    s_nop 0
-; GFX940-NEXT:    v_cndmask_b32_e32 v5, v8, v9, vcc
-; GFX940-NEXT:    v_cndmask_b32_e64 v2, v6, v7, s[0:1]
-; GFX940-NEXT:    v_perm_b32 v2, v5, v2, s5
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v2, v0, v4, v2
+; GFX940-NEXT:    ds_pk_add_rtn_bf16 v0, v0, v1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v4
-; GFX940-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[2:3]
-; GFX940-NEXT:    s_cbranch_execnz .LBB16_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX940-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: lds_atomic_fadd_ret_v2bf16:
@@ -3527,45 +3437,10 @@ define <2 x bfloat> @lds_atomic_fadd_ret_v2bf16(ptr addrspace(3) %ptr, <2 x bflo
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    ds_load_b32 v2, v0
-; GFX12-NEXT:    v_lshlrev_b32_e32 v3, 16, v1
-; GFX12-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX12-NEXT:    s_mov_b32 s1, 0
-; GFX12-NEXT:  .LBB16_1: ; %atomicrmw.start
-; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX12-NEXT:    s_wait_dscnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v4, v2
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_and_b32_e32 v5, 0xffff0000, v4
-; GFX12-NEXT:    v_add_f32_e32 v5, v5, v1
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX12-NEXT:    v_bfe_u32 v7, v5, 16, 1
-; GFX12-NEXT:    v_or_b32_e32 v9, 0x400000, v5
-; GFX12-NEXT:    v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX12-NEXT:    v_add3_u32 v7, v7, v5, 0x7fff
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_dual_cndmask_b32 v5, v7, v9 :: v_dual_lshlrev_b32 v2, 16, v4
-; GFX12-NEXT:    v_add_f32_e32 v2, v2, v3
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX12-NEXT:    v_bfe_u32 v6, v2, 16, 1
-; GFX12-NEXT:    v_or_b32_e32 v8, 0x400000, v2
-; GFX12-NEXT:    v_cmp_u_f32_e64 s0, v2, v2
-; GFX12-NEXT:    v_add3_u32 v6, v6, v2, 0x7fff
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_cndmask_b32_e64 v2, v6, v8, s0
-; GFX12-NEXT:    v_perm_b32 v2, v5, v2, 0x7060302
 ; GFX12-NEXT:    s_wait_storecnt 0x0
-; GFX12-NEXT:    ds_cmpstore_rtn_b32 v2, v0, v2, v4
+; GFX12-NEXT:    ds_pk_add_rtn_bf16 v0, v0, v1
 ; GFX12-NEXT:    s_wait_dscnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_SE
-; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v4
-; GFX12-NEXT:    s_or_b32 s1, vcc_lo, s1
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s1
-; GFX12-NEXT:    s_cbranch_execnz .LBB16_1
-; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX12-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX7-LABEL: lds_atomic_fadd_ret_v2bf16:
@@ -3769,40 +3644,8 @@ define void @lds_atomic_fadd_noret_v2bf16(ptr addrspace(3) %ptr, <2 x bfloat> %v
 ; GFX940-LABEL: lds_atomic_fadd_noret_v2bf16:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    ds_read_b32 v3, v0
-; GFX940-NEXT:    s_mov_b64 s[2:3], 0
-; GFX940-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
-; GFX940-NEXT:    s_movk_i32 s4, 0x7fff
-; GFX940-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX940-NEXT:    s_mov_b32 s5, 0x7060302
-; GFX940-NEXT:  .LBB17_1: ; %atomicrmw.start
-; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_lshlrev_b32_e32 v4, 16, v3
-; GFX940-NEXT:    v_and_b32_e32 v5, 0xffff0000, v3
-; GFX940-NEXT:    v_add_f32_e32 v4, v4, v2
-; GFX940-NEXT:    v_add_f32_e32 v5, v5, v1
-; GFX940-NEXT:    v_bfe_u32 v6, v4, 16, 1
-; GFX940-NEXT:    v_bfe_u32 v8, v5, 16, 1
-; GFX940-NEXT:    v_or_b32_e32 v7, 0x400000, v4
-; GFX940-NEXT:    v_or_b32_e32 v9, 0x400000, v5
-; GFX940-NEXT:    v_add3_u32 v6, v6, v4, s4
-; GFX940-NEXT:    v_add3_u32 v8, v8, v5, s4
-; GFX940-NEXT:    v_cmp_u_f32_e32 vcc, v5, v5
-; GFX940-NEXT:    v_cmp_u_f32_e64 s[0:1], v4, v4
-; GFX940-NEXT:    s_nop 0
-; GFX940-NEXT:    v_cndmask_b32_e32 v5, v8, v9, vcc
-; GFX940-NEXT:    v_cndmask_b32_e64 v4, v6, v7, s[0:1]
-; GFX940-NEXT:    v_perm_b32 v4, v5, v4, s5
-; GFX940-NEXT:    ds_cmpst_rtn_b32 v4, v0, v3, v4
+; GFX940-NEXT:    ds_pk_add_bf16 v0, v1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v3
-; GFX940-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
-; GFX940-NEXT:    v_mov_b32_e32 v3, v4
-; GFX940-NEXT:    s_andn2_b64 exec, exec, s[2:3]
-; GFX940-NEXT:    s_cbranch_execnz .LBB17_1
-; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX940-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: lds_atomic_fadd_noret_v2bf16:
@@ -3812,43 +3655,10 @@ define void @lds_atomic_fadd_noret_v2bf16(ptr addrspace(3) %ptr, <2 x bfloat> %v
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    ds_load_b32 v3, v0
-; GFX12-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
-; GFX12-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX12-NEXT:    s_mov_b32 s1, 0
-; GFX12-NEXT:  .LBB17_1: ; %atomicrmw.start
-;...
[truncated]

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

@arsenm arsenm force-pushed the users/arsenm/amdgpu-legalize-vector-atomicrmw-fadd-local branch 2 times, most recently from ba80b77 to 76cbecc Compare June 14, 2024 19:42
Copy link
Contributor Author

arsenm commented Jun 15, 2024

Merge activity

  • Jun 15, 3:51 AM EDT: @arsenm started a stack merge that includes this pull request via Graphite.
  • Jun 15, 3:53 AM EDT: Graphite rebased this pull request as part of a merge.
  • Jun 15, 3:55 AM EDT: @arsenm merged this pull request with Graphite.

@arsenm arsenm force-pushed the users/arsenm/amdgpu-legalize-vector-atomicrmw-fadd-local branch from 76cbecc to 122eec1 Compare June 15, 2024 07:53
@arsenm arsenm merged commit 0a9a5f9 into main Jun 15, 2024
3 of 5 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu-legalize-vector-atomicrmw-fadd-local branch June 15, 2024 07:55
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 27, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 27, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 27, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 27, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Sep 30, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.
This patch enables building of LLVM_AtomicRMWOp with fixed vectors of
16 bit fp values as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Oct 1, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.
This patch enables building of LLVM_AtomicRMWOp with fixed vectors of
16 bit fp values as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Oct 1, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.
This patch enables building of LLVM_AtomicRMWOp with fixed vectors of
16 bit fp values as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Oct 2, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.
This patch enables building of LLVM_AtomicRMWOp with fixed vectors of
16 bit fp values as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Oct 2, 2024
As far as AMDGPU target supports vectorization for atomic_rmw operation,
allow construction of LLVM_AtomicRMWOp with 16 bit floating point values.
This patch enables building of LLVM_AtomicRMWOp with fixed vectors of
16 bit fp values as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
joviliast added a commit to joviliast/llvm-project that referenced this pull request Oct 2, 2024
As far as AMDGPU target supports vectorization for `atomic_rmw fadd` operation,
enable building of `LLVM_AtomicRMWOp fadd` with fixed vectors of 16 bit fp values
as operands.

See also: llvm#94845, llvm#95393, llvm#95394

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants