performance regression in clang-19 when using computed goto #106846

mikulas-patocka · 2024-08-31T11:13:25Z

Hi

I noticed that the Ajla programming language ( https://www.ajla-lang.cz/ ) runs slower when being compiled by clang-19 compared to clang-18 or gcc-14. I looked at the assembler, and it turns out that the bytecode interpreter (the file "ipret.c") is not compiled as efficiently as it could be. In particular, clang-19 joins all the "goto *next_label" statements in a function into just one "jmp *" instruction. That reduces code size, but it also makes branch prediction inefficient, because the CPU cannot learn that a single instruction jumps to multiple targets.

I created this example that shows the issue:
http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.c
(use: clang-19 -O2 computed-goto.c && time ./a.out)

The results (in seconds) are here:
http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.txt

We can see that the worst slowdown happens on Sandy Bridge. Zen 4 shows no slowdown, so it seems that it has smart indirect branch predictor that can predict multiple jump targets from a single instruction.

nikic · 2024-08-31T15:04:29Z

This is caused by #78582, we get separate indirect gotos with -mllvm -tail-dup-pred-size=30. cc @DianQK Possibly we should not apply the limitation to INDIRECTBR, which is more profitable to tail-duplicate than other instructions.

mikulas-patocka · 2024-08-31T18:18:18Z

-mllvm -tail-dup-pred-size=30 helps the testcase, but not Ajla. If you compile Ajla with clang-19 -O2 -mllvm -tail-dup-pred-size=30 and do
$ objdump -d ipret.o | grep 'jmp \*'
it shows that there is just one indirect jump in the whole file.

nikic · 2024-08-31T18:23:00Z

What about -mllvm -tail-dup-pred-size=1000? The value needs to be about as large as the number of instructions / indirect jumps.

mikulas-patocka · 2024-08-31T20:49:13Z

Yes - there are 2284 instructions in the Ajla interpreter. So, I set tail-dup-pred-size=3000 and I get the same output as with clang-18 - 2466 indirect jumps in ipret.o.

pinskia · 2024-08-31T23:29:20Z

Note GCC (RTL) has a specific pass which duplicates the computed gotos (as GCC merges all computed goto into one BB for performance reasons) which was added in GCC 4.0 and then improved for GCC 7.1.0 (to also unduplicate some related instructions). This pass happens late in the pipeline after register allocation. This is just explaining GCC's solution to the problem and nothing more.

DianQK · 2024-09-01T05:13:12Z

This is caused by #78582, we get separate indirect gotos with -mllvm -tail-dup-pred-size=30. cc @DianQK Possibly we should not apply the limitation to INDIRECTBR, which is more profitable to tail-duplicate than other instructions.

It looks like makes sense. Thanks!
I will check it later.

DianQK · 2024-09-03T14:34:59Z

The switch instruction also becomes an INDIRECTBR machine instruction, and I think we should apply the limitation to PHI.
If I remember correctly, both #78578 and #79993 (comment) contain a massive PHI. Duplicating the BB that contains the massive PHI causes the CFG to become exceptionally complex, leading to issues with compile time and runtime performance.

DianQK · 2024-09-12T05:13:10Z

@mikulas-patocka Can you provide an example of Ajla? I may need to investigate this with the help of it.

BTW, whether using INDIRECTBR directly or coming over from SWITCH, we introduce a lot of PHIs.

mikulas-patocka · 2024-09-12T17:51:59Z

The Ajla compiler is written in Ajla, you can have a look at files in newlib/compiler/. I usually use self-compilation as a benchmark.

Run './configure && make' with CFLAGS='-O2 -DDEBUG_ENV'. The DEBUG_ENV macro enables envirnment variables that are useful for debugging.

Set the environment variable 'export CG=none' - this will disable the machine code generator and use only the interpreter.

Run the script ./scripts/update.sh - this will compile the Ajla compiler itself and it will re-generate the file builtin.pcd.
The results (on i7-2640M, Sandy Bridge, 2 cores, 4 threads):
gcc-14: 22.8s
clang-18: 22.2s
clang-19: 26.8s
The results (on 7840U, Zen 4, 8 cores, 16 threads):
gcc-14: 4.1s
clang-18: 4.1s
clang-19: 4.4s

If you want some simple benchmark, copy this piece of code to a file loop.ajla

fn main
[
        var sum := 0;
        for i := 0 to ston(args[0]) do
                sum += 1;
        write(h[1], ntos(sum) + nl);
]

and run it with 'CG=none ./ajla --nosave loop.ajla 1000000000'
The results (on i7-2640M):
gcc-14: 17.3s
clang-18: 15.3s
clang-19: 39.0s
On 7840U there is too much jitter in this test.

DianQK · 2024-10-27T02:14:54Z

By comparing the results from gcc, I can observe the following:

In the initial OOM example, which creates multiple consecutive switch statements, gcc did not apply tail duplication.
For computed-goto.c and ipret.c, it appears that clang applied more tail duplication (perhaps unnecessarily?).
The compile time for ipret.c with clang -O3 is unacceptably long, mainly due to Branch Probability Basic Block Placement (which is definitely a separate issue).

Next, I'll look into some gcc code to see if I can find anything useful. Perhaps gcc is simply differentiating between goto and switch at the source level? :) Although I think they should be interchangeable in this scenario.

Detailed commands

$ time -f '%M' gcc -O3 -c oom_manual2.c -S
129608
$ cat oom_manual2.s | grep "jmp.*%" | wc -l
26
$ wc -l oom_manual2.s
18494 oom_manual2.s

$ time -f '%M' clang -O3 -c oom_manual2.c -mllvm -tail-dup-pred-size=3000 -mllvm -tail-dup-succ-size=3000 -S
836248
$ cat oom_manual2.s | grep "jmp.*%" | wc -l
3081
$ wc -l oom_manual2.s
50527 oom_manual2.s


$ time -f '%M' gcc -O3 -S -c computed-goto.c
47556
$ cat computed-goto.s | grep "jmp.*%" | wc -l
32
$ wc -l computed-goto.s
1184 computed-goto.s

$ time -f '%M' clang -O3 -mllvm -tail-dup-pred-size=3000 -mllvm -tail-dup-succ-size=3000 -S -c computed-goto.c
119084
$ cat computed-goto.s | grep "jmp.*%" | wc -l
51
$ wc -l computed-goto.s
1717 computed-goto.s

DianQK · 2024-11-03T13:51:12Z

Possibly we should not apply the limitation to INDIRECTBR, which is more profitable to tail-duplicate than other instructions.

A quick note:
Based on the comments below, this is undoubtedly what GCC does: https://github.com/gcc-mirror/gcc/blob/7f67acf60c5429895d7c9e5df81796753e2913e0/gcc/rtlanal.cc#L3615-L3621.

/* Return true if INSN is an indirect jump (aka computed jump).

   Tablejumps and casesi insns are not considered indirect jumps;
   we can recognize them by a (use (label_ref)).  */

I’ll go with this approach, but from the machine instructions: https://llvm.godbolt.org/z/jcje1xPaW, I haven’t found any fundamental difference between computed goto and table jumps. I plan to investigate further and then file this PR.

nelhage · 2025-02-13T00:11:23Z

It looks like this issue is also impacting CPython: python/cpython#129987 (comment)

I measured a ~4% performance improvement on x86_64 by using an __asm__ volatile("") hack to force the jumps to remain deduplicated; I haven't tested with -mllvm -tail-dup-pred-size=5000 yet but I expect a similar improvement.

DianQK · 2025-02-13T01:30:34Z

It looks like this issue is also impacting CPython: python/cpython#129987 (comment)

I measured a ~4% performance improvement on x86_64 by using an __asm__ volatile("") hack to force the jumps to remain deduplicated; I haven't tested with -mllvm -tail-dup-pred-size=5000 yet but I expect a similar improvement.

Does #114990 or #116072 fix it?

nelhage · 2025-02-13T01:37:46Z

I haven't tested, but I do notice that #116072 cites CPython as their motivation so I suspect we're looking at the same issue. Is there somewhere I can download a nightly or otherwise test without doing a source build myself?

DianQK · 2025-02-13T02:47:36Z

I haven't tested, but I do notice that #116072 cites CPython as their motivation so I suspect we're looking at the same issue. Is there somewhere I can download a nightly or otherwise test without doing a source build myself?

I'll test it. LLVM 20 might not be available in your distribution.

DianQK · 2025-02-13T04:06:00Z

I can confirm that #116072 does not resolve the issue, whereas #114990 does.

# main branch
$ objdump -S --disassemble=_PyEval_EvalFrameDefault Python/ceval.o | grep -cF 'DISPATCH()'
27
# 114990
$ objdump -S --disassemble=_PyEval_EvalFrameDefault Python/ceval.o | grep -cF 'DISPATCH()'
234

DianQK · 2025-02-13T04:10:36Z

I can confirm that #116072 does not resolve the issue, whereas #114990 does.

@nikic @fhahn I propose to merge #114990 and backport it. This issue has already affected multiple projects, and Python is a widely used one. In fact, this PR just reverts the computed goto part of #78582.

nelhage · 2025-03-06T02:52:40Z

I've done some more benchmarking here.

I'd previously estimated the cost here as ~4% for CPython; my current data puts the number at closer to 10% (!).

I have also benchmarked #114990, and confirmed that it fixes the performance regression, as well as fixing the tail-duplication logic (as tested by @DianQK above).

Fidget-Spinner · 2025-03-06T15:49:17Z

As much as I would like the tail call interpreter to beat the computed goto one, fixing the computed goto interpreter is more important as it affects more systems :). Thanks Nelson for your investigation of this, and DianQK for the patch!

zeux · 2025-03-10T01:17:12Z

FWIW this also affects Luau interpreter (https://github.com/luau-lang/luau); on a Zen 4 system, running the Luau benchmark suite (which reports geomean deltas) with luau -O2 and using clang-18 built binary as a baseline I get:

'luau-19 -O2' change is -15.590% negative on average
'luau-19-fix -O2' change is 2.966% positive on average

luau-19 is built with clang-19, luau-19-fix is built with additional settings -mllvm -tail-dup-pred-size=256 -mllvm -tail-dup-succ-size=256.

The degradation with clang-20 is less significant but still severe, perhaps the change to handling blocks without phi helps here but does not help enough:

'luau-20 -O2' change is -8.727% negative on average
'luau-20-fix -O2' change is 3.834% positive on average

github-actions bot added the new issue label Aug 31, 2024

mikulas-patocka mentioned this issue Aug 31, 2024

-Wall causes massive compilation slowdown #105959

Open

EugeneZelenko added llvm:optimizations and removed new issue labels Aug 31, 2024

nikic added llvm:codegen regression:19 Regression in 19 release missed-optimization and removed llvm:optimizations labels Aug 31, 2024

DianQK self-assigned this Sep 1, 2024

DianQK linked a pull request Nov 5, 2024 that will close this issue

[TailDuplicator] Do not restrict the computed gotos #114990

Open

fhahn mentioned this issue Nov 18, 2024

[TailDup] Allow large number of predecessors/successors without phis. #116072

Merged

nelhage mentioned this issue Feb 12, 2025

computed-goto interpreter: Prevent the compiler from merging DISPATCH calls python/cpython#129987

Open

nelhage-servers pushed a commit to nelhage/cpython-interp-perf that referenced this issue Mar 5, 2025

Try the LLVM flag from llvm/llvm-project#106846

84df289

nelhage mentioned this issue Mar 6, 2025

gh-128563: A new tail-calling interpreter python/cpython#128718

Merged

zanieb mentioned this issue Mar 8, 2025

Upgrade to LLVM 20 indygreg/toolchain-tools#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression in clang-19 when using computed goto #106846

performance regression in clang-19 when using computed goto #106846

mikulas-patocka commented Aug 31, 2024

nikic commented Aug 31, 2024

mikulas-patocka commented Aug 31, 2024 •

edited

Loading

nikic commented Aug 31, 2024

mikulas-patocka commented Aug 31, 2024

pinskia commented Aug 31, 2024

DianQK commented Sep 1, 2024

DianQK commented Sep 3, 2024

DianQK commented Sep 12, 2024

mikulas-patocka commented Sep 12, 2024

DianQK commented Oct 27, 2024

DianQK commented Nov 3, 2024

nelhage commented Feb 13, 2025

DianQK commented Feb 13, 2025

nelhage commented Feb 13, 2025

DianQK commented Feb 13, 2025

DianQK commented Feb 13, 2025 •

edited

Loading

DianQK commented Feb 13, 2025 •

edited

Loading

nelhage commented Mar 6, 2025

Fidget-Spinner commented Mar 6, 2025 •

edited

Loading

zeux commented Mar 10, 2025 •

edited

Loading

performance regression in clang-19 when using computed goto #106846

performance regression in clang-19 when using computed goto #106846

Comments

mikulas-patocka commented Aug 31, 2024

nikic commented Aug 31, 2024

mikulas-patocka commented Aug 31, 2024 • edited Loading

nikic commented Aug 31, 2024

mikulas-patocka commented Aug 31, 2024

pinskia commented Aug 31, 2024

DianQK commented Sep 1, 2024

DianQK commented Sep 3, 2024

DianQK commented Sep 12, 2024

mikulas-patocka commented Sep 12, 2024

DianQK commented Oct 27, 2024

DianQK commented Nov 3, 2024

nelhage commented Feb 13, 2025

DianQK commented Feb 13, 2025

nelhage commented Feb 13, 2025

DianQK commented Feb 13, 2025

DianQK commented Feb 13, 2025 • edited Loading

DianQK commented Feb 13, 2025 • edited Loading

nelhage commented Mar 6, 2025

Fidget-Spinner commented Mar 6, 2025 • edited Loading

zeux commented Mar 10, 2025 • edited Loading

mikulas-patocka commented Aug 31, 2024 •

edited

Loading

DianQK commented Feb 13, 2025 •

edited

Loading

DianQK commented Feb 13, 2025 •

edited

Loading

Fidget-Spinner commented Mar 6, 2025 •

edited

Loading

zeux commented Mar 10, 2025 •

edited

Loading