-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TailDup] Allow large number of predecessors/successors without phis. #116072
Conversation
This adjusts the threshold logic added in llvm#78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs.
@llvm/pr-subscribers-backend-x86 Author: Florian Hahn (fhahn) ChangesThis adjusts the threshold logic added in #78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs. Full diff: https://github.com/llvm/llvm-project/pull/116072.diff 2 Files Affected:
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index d0013923838b35..a333122b2ba035 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -573,14 +573,6 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
if (TailBB.isSuccessor(&TailBB))
return false;
- // Duplicating a BB which has both multiple predecessors and successors will
- // result in a complex CFG and also may cause huge amount of PHI nodes. If we
- // want to remove this limitation, we have to address
- // https://github.com/llvm/llvm-project/issues/78578.
- if (TailBB.pred_size() > TailDupPredSize &&
- TailBB.succ_size() > TailDupSuccSize)
- return false;
-
// Set the limit on the cost to duplicate. When optimizing for size,
// duplicate only one, because one branch instruction can be eliminated to
// compensate for the duplication.
@@ -618,6 +610,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
// Check the instructions in the block to determine whether tail-duplication
// is invalid or unlikely to be profitable.
unsigned InstrCount = 0;
+ unsigned NumPhis = 0;
for (MachineInstr &MI : TailBB) {
// Non-duplicable things shouldn't be tail-duplicated.
// CFI instructions are marked as non-duplicable, because Darwin compact
@@ -661,6 +654,20 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
if (InstrCount > MaxDuplicateCount)
return false;
+ NumPhis += MI.isPHI();
+ }
+
+ // Duplicating a BB which has both multiple predecessors and successors will
+ // may cause huge amount of PHI nodes. If we want to remove this limitation,
+ // we have to address https://github.com/llvm/llvm-project/issues/78578.
+ if (TailBB.pred_size() > TailDupPredSize &&
+ TailBB.succ_size() > TailDupSuccSize) {
+ // If TailBB or any of its successors contains a phi, we may have to add a
+ // large number of additional phis with additional incoming values.
+ if (NumPhis != 0 || any_of(TailBB.successors(), [](MachineBasicBlock *MBB) {
+ return any_of(*MBB, [](MachineInstr &MI) { return MI.isPHI(); });
+ }))
+ return false;
}
// Check if any of the successors of TailBB has a PHI node in which the
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-succ-size.mir b/llvm/test/CodeGen/X86/tail-dup-pred-succ-size.mir
index 1d17672e2c6bd0..2f1ff76fda76c9 100644
--- a/llvm/test/CodeGen/X86/tail-dup-pred-succ-size.mir
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-succ-size.mir
@@ -538,43 +538,47 @@ body: |
; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.2:
- ; LIMIT-NEXT: successors: %bb.7(0x80000000)
+ ; LIMIT-NEXT: successors: %bb.9(0x20000000), %bb.10(0x20000000), %bb.11(0x20000000), %bb.12(0x20000000)
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: JMP_1 %bb.7
+ ; LIMIT-NEXT: [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri1]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, [[AND32ri1]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.3:
- ; LIMIT-NEXT: successors: %bb.7(0x80000000)
+ ; LIMIT-NEXT: successors: %bb.9(0x20000000), %bb.10(0x20000000), %bb.11(0x20000000), %bb.12(0x20000000)
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
- ; LIMIT-NEXT: JMP_1 %bb.7
+ ; LIMIT-NEXT: [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri2:%[0-9]+]]:gr32 = AND32ri [[SHR32ri3]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG2:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, [[AND32ri2]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG2]], %jump-table.1, $noreg
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.4:
- ; LIMIT-NEXT: successors: %bb.7(0x80000000)
+ ; LIMIT-NEXT: successors: %bb.9(0x20000000), %bb.10(0x20000000), %bb.11(0x20000000), %bb.12(0x20000000)
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
- ; LIMIT-NEXT: JMP_1 %bb.7
+ ; LIMIT-NEXT: [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SHR32ri5:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri3:%[0-9]+]]:gr32 = AND32ri [[SHR32ri5]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG3:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, [[AND32ri3]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG3]], %jump-table.1, $noreg
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.5:
- ; LIMIT-NEXT: successors: %bb.7(0x80000000)
+ ; LIMIT-NEXT: successors: %bb.9(0x20000000), %bb.10(0x20000000), %bb.11(0x20000000), %bb.12(0x20000000)
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
- ; LIMIT-NEXT: JMP_1 %bb.7
+ ; LIMIT-NEXT: [[SHR32ri6:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SHR32ri7:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri4:%[0-9]+]]:gr32 = AND32ri [[SHR32ri7]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG4:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, [[AND32ri4]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG4]], %jump-table.1, $noreg
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.6:
; LIMIT-NEXT: successors:
; LIMIT-NEXT: {{ $}}
- ; LIMIT-NEXT: bb.7:
- ; LIMIT-NEXT: successors: %bb.9(0x20000000), %bb.10(0x20000000), %bb.11(0x20000000), %bb.12(0x20000000)
- ; LIMIT-NEXT: {{ $}}
- ; LIMIT-NEXT: [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
- ; LIMIT-NEXT: [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
- ; LIMIT-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
- ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg
- ; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.9:
; LIMIT-NEXT: [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[MOV32rm4]] :: (store (s32))
@@ -583,23 +587,23 @@ body: |
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.10:
; LIMIT-NEXT: [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri5:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm5]], 1, implicit-def dead $eflags
- ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri5]] :: (store (s32))
- ; LIMIT-NEXT: $eax = COPY [[SHR32ri5]]
+ ; LIMIT-NEXT: [[SHR32ri8:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm5]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri8]] :: (store (s32))
+ ; LIMIT-NEXT: $eax = COPY [[SHR32ri8]]
; LIMIT-NEXT: RET 0, $eax
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.11:
; LIMIT-NEXT: [[MOV32rm6:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri6:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm6]], 2, implicit-def dead $eflags
- ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri6]] :: (store (s32))
- ; LIMIT-NEXT: $eax = COPY [[SHR32ri6]]
+ ; LIMIT-NEXT: [[SHR32ri9:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm6]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri9]] :: (store (s32))
+ ; LIMIT-NEXT: $eax = COPY [[SHR32ri9]]
; LIMIT-NEXT: RET 0, $eax
; LIMIT-NEXT: {{ $}}
; LIMIT-NEXT: bb.12:
; LIMIT-NEXT: [[MOV32rm7:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg
- ; LIMIT-NEXT: [[SHR32ri7:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm7]], 6, implicit-def dead $eflags
- ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri7]] :: (store (s32))
- ; LIMIT-NEXT: $eax = COPY [[SHR32ri7]]
+ ; LIMIT-NEXT: [[SHR32ri10:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm7]], 6, implicit-def dead $eflags
+ ; LIMIT-NEXT: MOV32mr [[COPY1]], 1, $noreg, 0, $noreg, [[SHR32ri10]] :: (store (s32))
+ ; LIMIT-NEXT: $eax = COPY [[SHR32ri10]]
; LIMIT-NEXT: RET 0, $eax
;
; NOLIMIT-LABEL: name: foo_no_phis
|
What's the relationship between this and #114990? |
Are these inputs computed gotos or jump tables? If they are computed gotos, I think we can land #114990. In fact, I don’t see any difference between computed gotos and jump tables—both can include or exclude many PHIs. To me, the only difference is that a jump table has one extra jump at the source level compared to a computed goto. So, I believe that when users use computed gotos, they expect longer compile times and larger code size. |
Ah I missed the other PR, thanks! For this particular case, it is a computed GOTO, but there are other computed GOTOs, but completely removing the cutoff increases the # of instructions by 10%, without any gain. In general, I think we should try to avoid cut-offs if there are cases we can handle with reasonable compile-times. I am not sure if completely ignoring the cutoff for computed GOTOs is the best way forward, as I don't see a fundamental difference between coming from jump tables or compute GOTOs. Ideally we would address allow cases we can reasonably handle (e.g. because there are no phis to add) and/or by addressing the extra complexity for adding the additional edges to the CFG. For the test-case from #106846 (http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.c), tail folding is applied with the current patch as well, but only the inner level I think. I don't have a suitable X86 to test if it restores the original performance |
I think this is related to the specific performance of CPU branch prediction, which is why sometimes we cannot observe performance improvements.
For me, we can view this transformation as restoring the CFG of the code written by the user. This is the only way I can see the difference between jump tables (switch) and computed GOTOs. That I'm saying, when users use computed GOTOs, they accept longer compile times and larger code size.
For me, it makes sense to apply this to jump tables under some limited.
IMO, the current patch addresses some specific scenarios. In fact, I believe the main compile time is spent on the PHI nodes added after duplication. As we see it, there is no fundamental difference between jump tables and computed GOTOs, so I think this patch is odd that we still prevent some computed GOTOs. |
IIUC #114990 in the current version only allows duplication for computed GOTOs? With this patch, there's no extra restrictions on the types we can fold, we just skip/relax the aggressive cut-offs when it won't hurt compile-time (much). It would be great if we could get this resolved one way or another for the Clang 20 release, as this at the moment causes a 2-3% performance loss for Python workloads on ARM64 :) I rebased and tested #114990 and unfortunately it doesn't yield the same perf gain ( only in the noise/ < 0.5%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could get this resolved one way or another for the Clang 20 release, as this at the moment causes a 2-3% performance loss for Python workloads on ARM64 :) I rebased and tested #114990 and unfortunately it doesn't yield the same perf gain ( only in the noise/ < 0.5%)
LGTM. I believe these two PRs address different issues. I can verify that the PR doesn't resolve #106846 due to phis.
Thanks for double-checking, that's my understanding as well! |
…thout phis. (#116072) This adjusts the threshold logic added in #78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs. PR: llvm/llvm-project#116072
…llvm#116072) This adjusts the threshold logic added in llvm#78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs. PR: llvm#116072
This adjusts the threshold logic added in #78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors.
In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions.
This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs.