Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen][OOM] Switch statements without default branching result in long compile times or OOM #78578

Closed
DianQK opened this issue Jan 18, 2024 · 1 comment · Fixed by #78582
Closed

Comments

@DianQK
Copy link
Member

DianQK commented Jan 18, 2024

I tried the following code, which takes more than 3s to compile using clang 17.0.6. (The code size is also a bit hard to accept.)

oom_manual2.c

int src();
int f1(unsigned int *b) {
#define SWITCH(in1, out1) \
  unsigned int out1;\
  switch (in1) {           \
    case 0: out1 = b[0] >> 0; break; \
    case 1: out1 = b[0] >> 1; break; \
    case 2: out1 = b[0] >> 2; break; \
    case 3: out1 = b[0] >> 3; break; \
    case 4: out1 = b[0] >> 4; break; \
    case 5: out1 = b[0] >> 5; break; \
    case 6: out1 = b[0] >> 6; break; \
    case 7: out1 = b[0] >> 7; break; \
    case 8: out1 = b[0] >> 8; break; \
    case 9: out1 = b[0] >> 9; break; \
    case 10: out1 = b[0] >> 10; break; \
    case 11: out1 = b[0] >> 11; break; \
    case 12: out1 = b[0] >> 12; break; \
    case 13: out1 = b[0] >> 13; break; \
    case 14: out1 = b[0] >> 14; break; \
    case 15: out1 = b[0] >> 15; break; \
    case 16: out1 = b[0] >> 16; break; \
    case 17: out1 = b[0] >> 17; break; \
    case 18: out1 = b[0] >> 18; break; \
    case 19: out1 = b[0] >> 19; break; \
    case 20: out1 = b[0] >> 20; break; \
    case 21: out1 = b[0] >> 21; break; \
    case 22: out1 = b[0] >> 22; break; \
    case 23: out1 = b[0] >> 23; break; \
    case 24: out1 = b[0] >> 24; break; \
    case 25: out1 = b[0] >> 25; break; \
    case 26: out1 = b[0] >> 26; break; \
    case 27: out1 = b[0] >> 27; break; \
    case 28: out1 = b[0] >> 28; break; \
    case 29: out1 = b[0] >> 29; break; \
    case 30: out1 = b[0] >> 30; break; \
    case 31: out1 = b[0] >> 31; break; \
    case 32: out1 = b[1] >> 0; break; \
    case 33: out1 = b[1] >> 1; break; \
    case 34: out1 = b[1] >> 2; break; \
    case 35: out1 = b[1] >> 3; break; \
    case 36: out1 = b[1] >> 4; break; \
    case 37: out1 = b[1] >> 5; break; \
    case 38: out1 = b[1] >> 6; break; \
    case 39: out1 = b[1] >> 7; break; \
    case 40: out1 = b[1] >> 8; break; \
    case 41: out1 = b[1] >> 9; break; \
    case 42: out1 = b[1] >> 10; break; \
    case 43: out1 = b[1] >> 11; break; \
    case 44: out1 = b[1] >> 12; break; \
    case 45: out1 = b[1] >> 13; break; \
    case 46: out1 = b[1] >> 14; break; \
    case 47: out1 = b[1] >> 15; break; \
    case 48: out1 = b[1] >> 16; break; \
    case 49: out1 = b[1] >> 17; break; \
    case 50: out1 = b[1] >> 18; break; \
    case 51: out1 = b[1] >> 19; break; \
    case 52: out1 = b[1] >> 20; break; \
    case 53: out1 = b[1] >> 21; break; \
    case 54: out1 = b[1] >> 22; break; \
    case 55: out1 = b[1] >> 23; break; \
    case 56: out1 = b[1] >> 24; break; \
    case 57: out1 = b[1] >> 25; break; \
    case 58: out1 = b[1] >> 26; break; \
    case 59: out1 = b[1] >> 27; break; \
    case 60: out1 = b[1] >> 28; break; \
    case 61: out1 = b[1] >> 29; break; \
    case 62: out1 = b[1] >> 30; break; \
    case 63: out1 = b[1] >> 31; break; \
    case 64: out1 = b[2] >> 0; break; \
    case 65: out1 = b[2] >> 1; break; \
    case 66: out1 = b[2] >> 2; break; \
    case 67: out1 = b[2] >> 3; break; \
    case 68: out1 = b[2] >> 4; break; \
    case 69: out1 = b[2] >> 5; break; \
    case 70: out1 = b[2] >> 6; break; \
    case 71: out1 = b[2] >> 7; break; \
    case 72: out1 = b[2] >> 8; break; \
    case 73: out1 = b[2] >> 9; break; \
    case 74: out1 = b[2] >> 10; break; \
    case 75: out1 = b[2] >> 11; break; \
    case 76: out1 = b[2] >> 12; break; \
    case 77: out1 = b[2] >> 13; break; \
    case 78: out1 = b[2] >> 14; break; \
    case 79: out1 = b[2] >> 15; break; \
    case 80: out1 = b[2] >> 16; break; \
    case 81: out1 = b[2] >> 17; break; \
    case 82: out1 = b[2] >> 18; break; \
    case 83: out1 = b[2] >> 19; break; \
    case 84: out1 = b[2] >> 20; break; \
    case 85: out1 = b[2] >> 21; break; \
    case 86: out1 = b[2] >> 22; break; \
    case 87: out1 = b[2] >> 23; break; \
    case 88: out1 = b[2] >> 24; break; \
    case 89: out1 = b[2] >> 25; break; \
    case 90: out1 = b[2] >> 26; break; \
    case 91: out1 = b[2] >> 27; break; \
    case 92: out1 = b[2] >> 28; break; \
    case 93: out1 = b[2] >> 29; break; \
    case 94: out1 = b[2] >> 30; break; \
    case 95: out1 = b[2] >> 31; break; \
    case 96: out1 = b[2] >> 0; break; \
    case 97: out1 = b[2] >> 1; break; \
    case 98: out1 = b[2] >> 2; break; \
    case 99: out1 = b[2] >> 3; break; \
    case 100: out1 = b[2] >> 4; break; \
    case 101: out1 = b[2] >> 5; break; \
    case 102: out1 = b[2] >> 6; break; \
    case 103: out1 = b[2] >> 7; break; \
    case 104: out1 = b[2] >> 8; break; \
    case 105: out1 = b[2] >> 9; break; \
    case 106: out1 = b[2] >> 10; break; \
    case 107: out1 = b[2] >> 11; break; \
    case 108: out1 = b[2] >> 12; break; \
    case 109: out1 = b[2] >> 13; break; \
    case 110: out1 = b[2] >> 14; break; \
    case 111: out1 = b[2] >> 15; break; \
    case 112: out1 = b[2] >> 16; break; \
    case 113: out1 = b[2] >> 17; break; \
    case 114: out1 = b[2] >> 18; break; \
    case 115: out1 = b[2] >> 19; break; \
    case 116: out1 = b[2] >> 20; break; \
    case 117: out1 = b[2] >> 21; break; \
    case 118: out1 = b[2] >> 22; break; \
    case 119: out1 = b[2] >> 23; break; \
    case 120: out1 = b[2] >> 24; break; \
    case 121: out1 = b[2] >> 25; break; \
    case 122: out1 = b[2] >> 26; break; \
    case 123: out1 = b[2] >> 27; break; \
    case 124: out1 = b[2] >> 28; break; \
    case 125: out1 = b[2] >> 29; break; \
    case 126: out1 = b[2] >> 30; break; \
    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \
  }
  unsigned int r = src();
  SWITCH(r >> 1 & 0x7fU, v1);
  SWITCH(r >> 27 | (r << 5 & 0x60U), v2);
  SWITCH(r >> 2 & 0x7fu, v3);
  SWITCH(r >> 9 & 0x7fu, v4);
  SWITCH(r >> 16 & 0x7fu, v5);
  SWITCH(r >> 23 & 0x7fu, v6);
  SWITCH(r >> 30 | (r << 2 & 0x7fu), v7);
  SWITCH(r >> 7 & 0x7fu, v8);
  SWITCH(r >> 12 & 0x7fu, v9);
  SWITCH(r << 4 & 0x7fu, v10);
  SWITCH(r >> 10 & 0x7fu, v11);
  SWITCH(r >> 24 & 0x7fu, v12);
  SWITCH(r >> 31 | (r << 1 & 0x7eu), v13);
  SWITCH(r >> 6 & 0x7fu, v14);
  SWITCH(r >> 13 & 0x7fu, v15);
  SWITCH(r >> 20 & 0x7fu, v16);
  SWITCH(r >> 27 | (r << 5 & 0x60u), v17);
  SWITCH(r >> 3 & 0x7fu, v18);
  SWITCH(r >> 9 & 0x7fu, v19);
  SWITCH(r >> 16 & 0x7fu, v20);
  SWITCH(r >> 23 & 0x7fu, v21);
  SWITCH(r >> 30 | (r << 2 & 0x7cu), v22);
  SWITCH(r >> 7 & 0x7fu, v23);
  SWITCH(r >> 12 & 0x7fu, v24);
  SWITCH(r >> 14 & 0x7fu, v25);
  SWITCH(r >> 15 & 0x7fu, v26);
  return (v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9 | v10 | v11 | v12 | v13 |
          v14 | v15 | v16 | v17 | v18 | v19 | v20 | v21 | v22 | v23 | v24 | v25 | v26);
}

If I change the 127 branch to the default branch, the time returns to normal.

diff --git a/oom_manual2.c b/oom_manual2.c
index aab82cf..7da3f79 100644
--- a/oom_manual2.c
+++ b/oom_manual2.c
@@ -130,8 +130,7 @@ int f1(unsigned int *b) {
     case 124: out1 = b[2] >> 28; break; \
     case 125: out1 = b[2] >> 29; break; \
     case 126: out1 = b[2] >> 30; break; \
-    case 127:  out1 = b[2] >> 31; break; \
-    default:  __builtin_unreachable(); break; \
+    default:  out1 = b[2] >> 31; break; \
   }
   unsigned int r = src();
   SWITCH(r >> 1 & 0x7fU, v1);

The complete command I used is as follows:

$ time clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual.c main.c -o oom_manual  0.31s user 0.04s system 100% cpu 0.347 total
$ time clang -O1 oom_manual2.c main.c -o oom_manual2
clang -O1 oom_manual2.c main.c -o oom_manual2  3.72s user 0.24s system 100% cpu 3.962 total
$ ls -l
total 272
-rwxr-xr-x 1 dianqk users  56K Jan 21 16:22 oom_manual
-rwxr-xr-x 1 dianqk users 192K Jan 21 16:22 oom_manual2
main.c

int src(void) {
  return 0;
}
int main(int argc, char **argv) {
  extern int f1(unsigned int *b, unsigned int r);
  f1(0, 0);
  return 0;
}

I've found that the main time-consumption is Early Tail Duplication.

$ time clang -O1 -c oom_manual2.c -emit-llvm -o oom_manual2.bc
clang -O1 -c oom_manual2.c -emit-llvm -o oom_manual2.bc  0.07s user 0.01s system 99% cpu 0.086 total
$ time llc -O1 oom_manual2.bc -filetype=null
llc -O1 oom_manual2.bc -filetype=null  3.65s user 0.22s system 99% cpu 3.868 total
$ llc -O1 oom_manual2.bc -filetype=null -time-trace
$ cat out.time-trace | jq -r '.traceEvents[] | select(.name == "RunPass" or .name == "Total RunPass") | "\(.dur) \(.name) \(.args.detail)"' | sed 's/"//g' | sort -nr | head -n5

3800384 Total RunPass null
1221923 RunPass Early Tail Duplication
1100348 RunPass Eliminate PHI nodes for register allocation
473867 RunPass Register Coalescer
120407 RunPass Live Variable Analysis

By adding the new TimeTraceScope, I can see that TailDuplicator::tailDuplicateAndUpdate and TailDuplicator::shouldTailDuplicate take up most of the time.

TimeTraceScope

diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121..80401e75c85d 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -158,6 +158,7 @@ bool TailDuplicator::tailDuplicateAndUpdate(
     SmallVectorImpl<MachineBasicBlock*> *DuplicatedPreds,
     function_ref<void(MachineBasicBlock *)> *RemovalCallback,
     SmallVectorImpl<MachineBasicBlock *> *CandidatePtr) {
+  llvm::TimeTraceScope TimeScope("TailDuplicator::tailDuplicateAndUpdate");
   // Save the successors list.
   SmallSetVector<MachineBasicBlock *, 8> Succs(MBB->succ_begin(),
                                                MBB->succ_end());
@@ -555,6 +556,7 @@ void TailDuplicator::updateSuccessorsPHIs(
 /// Determine if it is profitable to duplicate this block.
 bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
                                          MachineBasicBlock &TailBB) {
+  llvm::TimeTraceScope TimeScope("TailDuplicator::shouldTailDuplicate");
   // When doing tail-duplication during layout, the block ordering is in flux,
   // so canFallThrough returns a result based on incorrect information and
   // should just be ignored.


The original question is as follows (I've abbreviated it a bit)

int src();
int f1(unsigned int *b) {
#define SWITCH(in1, out1) \
  unsigned int out1;\
  switch (in1) {           \
    case 0: out1 = b[0] >> 0; break; \
...
    case 126: out1 = b[2] >> 30; break; \
    default:  out1 = b[2] >> 31; break; \
  }
  unsigned int r = src();
  SWITCH(r >> 1 & 0x7fU, v1);
...
  SWITCH(r >> 15 & 0x7fu, v26);
  return (v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9 | v10 | v11 | v12 | v13 |
          v14 | v15 | v16 | v17 | v18 | v19 | v20 | v21 | v22 | v23 | v24 | v25 | v26);
}
OLD=`/usr/bin/time -f '%M' clang_before -cc1 -O1  -emit-obj  oom_manual.cc -o /dev/null 2>&1`
NEW=`/usr/bin/time -f '%M' clang_after -cc1 -O1  -emit-obj  oom_manual.cc -o /dev/null 2>&1`
echo $OLD
echo $NEW
expr $OLD "*" 10 "<" $NEW > /dev/null
$ ./test_manual.sh
119816
1248148
$ echo $?
0

oom_manual.cc.txt

Originally posted by @dwblaikie in #76669 (comment)

cc @aeubanks @joanahalili @alexfh

@DianQK DianQK self-assigned this Jan 18, 2024
@DianQK DianQK added the llvm:codesize Code size issues label Jan 21, 2024
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 23, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2024
…tchs, r=oli-obk

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
Nadrieril added a commit to Nadrieril/rust that referenced this issue Jan 27, 2024
…witchs, r=oli-obk

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 27, 2024
…tchs, r=oli-obk

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 14, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
@DianQK
Copy link
Member Author

DianQK commented Feb 24, 2024

loop.ll.txt
sequential.ll.txt

The current known IR includes two types. I conducted compile-time and runtime performance analysis based on these two IRs, without considering code size in this analysis.

First, here are the characteristics of these two IRs:

  1. The structure of the first IR consists of multiple consecutive switch statements, each switch statement having 128 branches. This one is named sequential.ll.
  2. The second IR contains a switch statement within a loop, with 1548 branches. This one is named loop.ll.

Time Analysis

Firstly, in loop.ll, the primary time consumption does not occur within early-tailduplication, but rather in subsequent passes. A log is as follows:

===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 23.9270 seconds (23.9281 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   7.4577 ( 31.5%)   0.0000 (  0.0%)   7.4577 ( 31.2%)   7.4587 ( 31.2%)  Control Flow Optimizer
   6.0207 ( 25.4%)   0.0000 (  0.0%)   6.0207 ( 25.2%)   6.0207 ( 25.2%)  Eliminate PHI nodes for register allocation
   5.1544 ( 21.7%)   0.0050 (  2.2%)   5.1594 ( 21.6%)   5.1594 ( 21.6%)  Machine Cycle Info Analysis
   1.1156 (  4.7%)   0.0899 ( 39.7%)   1.2055 (  5.0%)   1.2055 (  5.0%)  Early Tail Duplication

In sequential.ll, the main time consumption is within early-tailduplication.

===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 3.8586 seconds (3.8587 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.0729 ( 29.8%)   0.1851 ( 72.9%)   1.2580 ( 32.6%)   1.2580 ( 32.6%)  Early Tail Duplication
   1.0898 ( 30.2%)   0.0000 (  0.0%)   1.0898 ( 28.2%)   1.0898 ( 28.2%)  Eliminate PHI nodes for register allocation
   0.4562 ( 12.7%)   0.0000 (  0.0%)   0.4562 ( 11.8%)   0.4562 ( 11.8%)  Register Coalescer
   0.1098 (  3.0%)   0.0099 (  3.9%)   0.1198 (  3.1%)   0.1198 (  3.1%)  Live Variable Analysis

Due to the ease of modifying loop.ll, I analyzed the relationship between the number of cases in the switch statement and time as follows:

branch count instructions:u wall clock
100 263315652 0.0203
200 2884081031 0.1543
300 7542067632 0.3801
400 15273164392 0.7649
500 27634154267 1.3796
600 46142991771 2.1719
700 70542257837 3.4965
800 98818321269 4.7965
900 137669121804 6.7335
1000 185039464508 8.6995
1100 242238896172 11.7165
1200 313184469140 14.9512
1300 397876721715 18.9531
1400 506221001288 23.5489
1500 607635891640 28.5366

image
image

Based on the data above, the relationship between compile time and the number of branches appears to be O(n^2).

The less drastic increase in compilation time for sequential.ll compared to loop.ll may be due to differences in the number of branches.

I believe the main issue causing this increased time is the duplication of critical BBs. A critical BB refers to one with multiple predecessors and multiple successors. In sequential.ll, there are 25 critical BBs with 128 predecessors and successors each, while loop.ll has one critical BB with 1548 predecessors and successors. After early-tailduplication, this critical BB in loop.ll becomes 2953 predecessors and remains 1548 successors. I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes.

Runtime Performance

Per this comment, I assume that early-tailduplication itself does not cause performance issues. However, I still encountered a performance problem: numerous PHI nodes may result in a significant number of stack operation instructions.

In sequential.ll, there are 15848 instructions related to rsp, reduced to 28 when using -disable-early-taildup. However, there are no rsp related instructions in loop.ll. I speculate that in sequential.ll, the consecutive switch statements make it difficult for PHI nodes to reuse a register.

What to Do?

Backend isn't my suit, and I have zero knowledge of register allocation/instruction selection algorithms. :)
However, I believe the key issue here is the duplication of critical BBs, which increases the number of critical BBs or BBs with multiple successors or predecessors.

My suggestion is to refrain from duplicating critical BBs. However, I cannot restrict the duplication to only BBs with one predecessor or one successor, as such duplication is clearly beneficial.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 28, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520. Fixes rust-lang#110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 29, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520. Fixes rust-lang#110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 29, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520. Fixes rust-lang#110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 8, 2024
…tchs, r=<try>

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520. Fixes rust-lang#110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 8, 2024
…tchs, r=oli-obk

Replace the default branch with an unreachable branch If it is the last variant

Fixes rust-lang#119520. Fixes rust-lang#110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Mar 9, 2024
…li-obk

Replace the default branch with an unreachable branch If it is the last variant

Fixes #119520. Fixes #110097.

LLVM currently has limited ability to eliminate dead branches in switches, even with the patch of llvm/llvm-project#73446.

The main reasons are as follows:

- Additional costs are required to calculate the range of values, and there exist many scenarios that cannot be analyzed accurately.
- Matching values by bitwise calculation cannot handle odd branches, nor can it handle values like `-1, 0, 1`. See [SimplifyCFG.cpp#L5424](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5424) and https://llvm.godbolt.org/z/qYMqhvMa8
- The current range information is continuous, even if the metadata for the range is submitted. See [ConstantRange.cpp#L1869-L1870](https://github.com/llvm/llvm-project/blob/llvmorg-17.0.6/llvm/lib/IR/ConstantRange.cpp#L1869-L1870).
- The metadata of the range may be lost in passes such as SROA. See https://rust.godbolt.org/z/e7f87vKMK.

Although we can make improvements, I think it would be more appropriate to put this issue to rustc first. After all, we can easily know the possible values.

Note that we've currently found a slow compilation problem in the presence of unreachable branches. See
llvm/llvm-project#78578.

r? compiler
@DianQK DianQK removed their assignment Mar 16, 2024
DianQK added a commit that referenced this issue Apr 17, 2024
…tail duplicating blocks (#78582)

Fixes #78578.

Duplicating a BB which has both multiple predecessors and successors
will result in a complex CFG and also may cause huge amount of PHI
nodes. See
#78578 (comment)
for a detailed description of the limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant