Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Avoid NEON ORR when NEON and SVE are unavailable #93940

Merged

Conversation

sdesmalen-arm
Copy link
Collaborator

For streaming-compatible functions with only +sme, we can't use
a NEON ORR (aliased as 'mov') for copies of Q-registers, so
we need to use a spill/fill instead.

This also fixes the fill, which should use the post-incrementing
addressing mode.

For streaming-compatible functions with only +sme, we can't use
a NEON ORR (aliased as 'mov') for copies of Q-registers, so
we need to use a spill/fill instead.

This also fixes the fill, which should use the post-incrementing
addressing mode.
@llvmbot
Copy link
Collaborator

llvmbot commented May 31, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Sander de Smalen (sdesmalen-arm)

Changes

For streaming-compatible functions with only +sme, we can't use
a NEON ORR (aliased as 'mov') for copies of Q-registers, so
we need to use a spill/fill instead.

This also fixes the fill, which should use the post-incrementing
addressing mode.


Full diff: https://github.com/llvm/llvm-project/pull/93940.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.cpp (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/arm64-reg-copy-noneon.ll (+16-9)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll (+8-4)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-test-register-mov.ll (+2-1)
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index aa0b7c93f8661..ce4b7e68fd625 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -4659,7 +4659,7 @@ void AArch64InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
           .addReg(AArch64::Z0 + (DestReg - AArch64::Q0), RegState::Define)
           .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0))
           .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0));
-    else if (Subtarget.hasNEON())
+    else if (Subtarget.isNeonAvailable())
       BuildMI(MBB, I, DL, get(AArch64::ORRv16i8), DestReg)
           .addReg(SrcReg)
           .addReg(SrcReg, getKillRegState(KillSrc));
@@ -4669,7 +4669,7 @@ void AArch64InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
           .addReg(SrcReg, getKillRegState(KillSrc))
           .addReg(AArch64::SP)
           .addImm(-16);
-      BuildMI(MBB, I, DL, get(AArch64::LDRQpre))
+      BuildMI(MBB, I, DL, get(AArch64::LDRQpost))
           .addReg(AArch64::SP, RegState::Define)
           .addReg(DestReg, RegState::Define)
           .addReg(AArch64::SP)
diff --git a/llvm/test/CodeGen/AArch64/arm64-reg-copy-noneon.ll b/llvm/test/CodeGen/AArch64/arm64-reg-copy-noneon.ll
index 29255ef187c1c..69cd295c309d1 100644
--- a/llvm/test/CodeGen/AArch64/arm64-reg-copy-noneon.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-reg-copy-noneon.ll
@@ -1,20 +1,27 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -mtriple=arm64-none-linux-gnu -mattr=-neon < %s | FileCheck %s
 
 define float @copy_FPR32(float %a, float %b) {
-;CHECK-LABEL: copy_FPR32:
-;CHECK: fmov s0, s1
+; CHECK-LABEL: copy_FPR32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov s0, s1
+; CHECK-NEXT:    ret
   ret float %b;
 }
-  
+
 define double @copy_FPR64(double %a, double %b) {
-;CHECK-LABEL: copy_FPR64:
-;CHECK: fmov d0, d1
+; CHECK-LABEL: copy_FPR64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov d0, d1
+; CHECK-NEXT:    ret
   ret double %b;
 }
-  
+
 define fp128 @copy_FPR128(fp128 %a, fp128 %b) {
-;CHECK-LABEL: copy_FPR128:
-;CHECK: str	q1, [sp, #-16]!
-;CHECK-NEXT: ldr	q0, [sp, #16]!
+; CHECK-LABEL: copy_FPR128:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str q1, [sp, #-16]!
+; CHECK-NEXT:    ldr q0, [sp], #16
+; CHECK-NEXT:    ret
   ret fp128 %b;
 }
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll
index be335c697707d..a689a539b0082 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll
@@ -980,7 +980,8 @@ define <32 x i8> @masked_load_v32i8(ptr %src, <32 x i1> %mask) {
 ; NONEON-NOSVE-NEXT:    tbnz w8, #1, .LBB3_3
 ; NONEON-NOSVE-NEXT:    b .LBB3_4
 ; NONEON-NOSVE-NEXT:  .LBB3_2:
-; NONEON-NOSVE-NEXT:    mov v0.16b, v1.16b
+; NONEON-NOSVE-NEXT:    str q1, [sp, #-16]!
+; NONEON-NOSVE-NEXT:    ldr q0, [sp], #16
 ; NONEON-NOSVE-NEXT:    tbz w8, #1, .LBB3_4
 ; NONEON-NOSVE-NEXT:  .LBB3_3: // %cond.load1
 ; NONEON-NOSVE-NEXT:    ldrb w10, [x0, #1]
@@ -2095,7 +2096,8 @@ define <16 x half> @masked_load_v16f16(ptr %src, <16 x i1> %mask) {
 ; NONEON-NOSVE-NEXT:    tbnz w8, #1, .LBB7_3
 ; NONEON-NOSVE-NEXT:    b .LBB7_4
 ; NONEON-NOSVE-NEXT:  .LBB7_2:
-; NONEON-NOSVE-NEXT:    mov v0.16b, v1.16b
+; NONEON-NOSVE-NEXT:    str q1, [sp, #-16]!
+; NONEON-NOSVE-NEXT:    ldr q0, [sp], #16
 ; NONEON-NOSVE-NEXT:    tbz w8, #1, .LBB7_4
 ; NONEON-NOSVE-NEXT:  .LBB7_3: // %cond.load1
 ; NONEON-NOSVE-NEXT:    ldr h2, [x0, #2]
@@ -2616,7 +2618,8 @@ define <8 x float> @masked_load_v8f32(ptr %src, <8 x i1> %mask) {
 ; NONEON-NOSVE-NEXT:    tbnz w8, #1, .LBB10_3
 ; NONEON-NOSVE-NEXT:    b .LBB10_4
 ; NONEON-NOSVE-NEXT:  .LBB10_2:
-; NONEON-NOSVE-NEXT:    mov v0.16b, v1.16b
+; NONEON-NOSVE-NEXT:    str q1, [sp, #-16]!
+; NONEON-NOSVE-NEXT:    ldr q0, [sp], #16
 ; NONEON-NOSVE-NEXT:    tbz w8, #1, .LBB10_4
 ; NONEON-NOSVE-NEXT:  .LBB10_3: // %cond.load1
 ; NONEON-NOSVE-NEXT:    ldr s2, [x0, #4]
@@ -2839,7 +2842,8 @@ define <4 x double> @masked_load_v4f64(ptr %src, <4 x i1> %mask) {
 ; NONEON-NOSVE-NEXT:    tbnz w8, #1, .LBB12_3
 ; NONEON-NOSVE-NEXT:    b .LBB12_4
 ; NONEON-NOSVE-NEXT:  .LBB12_2:
-; NONEON-NOSVE-NEXT:    mov v0.16b, v1.16b
+; NONEON-NOSVE-NEXT:    str q1, [sp, #-16]!
+; NONEON-NOSVE-NEXT:    ldr q0, [sp], #16
 ; NONEON-NOSVE-NEXT:    tbz w8, #1, .LBB12_4
 ; NONEON-NOSVE-NEXT:  .LBB12_3: // %cond.load1
 ; NONEON-NOSVE-NEXT:    ldr d2, [x0, #8]
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-test-register-mov.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-test-register-mov.ll
index 67cdde718e391..23adb1a4bc092 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-test-register-mov.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-test-register-mov.ll
@@ -15,7 +15,8 @@ define fp128 @test_streaming_compatible_register_mov(fp128 %q0, fp128 %q1) {
 ;
 ; NONEON-NOSVE-LABEL: test_streaming_compatible_register_mov:
 ; NONEON-NOSVE:       // %bb.0:
-; NONEON-NOSVE-NEXT:    mov v0.16b, v1.16b
+; NONEON-NOSVE-NEXT:    str q1, [sp, #-16]!
+; NONEON-NOSVE-NEXT:    ldr q0, [sp], #16
 ; NONEON-NOSVE-NEXT:    ret
   ret fp128 %q1
 }

@sdesmalen-arm sdesmalen-arm merged commit b71434f into llvm:main Jun 3, 2024
9 checks passed
Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code sequence here assumes there's no red zone; is that actually enforced somewhere? Can we allocate a stack slot instead of modifying the stack pointer?

As an alternative, it might be possible to use an integer register here, but I guess allocating that might be a bit complicated.

@efriedma-quic
Copy link
Collaborator

Ping

@sdesmalen-arm
Copy link
Collaborator Author

The code sequence here assumes there's no red zone; is that actually enforced somewhere? Can we allocate a stack slot instead of modifying the stack pointer?

As an alternative, it might be possible to use an integer register here, but I guess allocating that might be a bit complicated.

Good spot!

From what I can see, it seems better to just disable the Red Zone entirely for now.

The issue is that the way things currently work, the COPY is only expanded post register-allocation and post prologue-epilogue-insertion. Assuming we'd tell PEI to allocate a stack slot, then in the copyPhysReg function we'd need to emit the base + offset manually. Then, if the slot is allocated too far away from either the FP/SP, we'd need the Register Scavenger to either give us an extra register or spill a register to access it, but the Register Scavenger is not passed to copyPhysReg. I guess we could implement this mechanism manually using the emergency spill slot (or allocating a larger emergency slot for the spill/fill purpose), or as you suggest using a GPR register (for which we'd still need to do some sort of scavenging for free registers). Alternatively, we could also create another post-RA expand pass that runs before PEI, to let PEI do the heavy lifting.

Fixing the problem with a stop-gap (disabling red zone) could be a good first step. And if this turns out to be a genuine performance issue, we can come up with a better mechanism later. Does that make sense?

sdesmalen-arm added a commit to sdesmalen-arm/llvm-project that referenced this pull request Jun 10, 2024
@efriedma-quic
Copy link
Collaborator

That makes sense, sure.

sdesmalen-arm added a commit that referenced this pull request Jun 11, 2024
Lukacma pushed a commit to Lukacma/llvm-project that referenced this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants