[X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake` #91370

Systemcluster · 2024-05-07T18:14:16Z

The following code runs around 300% slower on Zen 4 when optimized for znver4 or skylake than when optimized for znver3 or other targets.

pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

Full code

pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

fn main() {
    let nums = std::hint::black_box(generate());
    let now = std::time::Instant::now();
    let sum = sum(&nums);
    println!("{:?} / {}", now.elapsed(), sum);
}

fn generate() -> Vec<i64> {
    let mut v = Vec::new();
    for i in 0..1000000000 {
        v.push(i);
    }
    v
}

Running on a Ryzen 7950X:

> rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs && ./main.exe
138.7342ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs && ./main.exe
136.2689ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs && ./main.exe   
136.0648ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs && ./main.exe   
543.1562ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs && ./main.exe   
137.4426ms / 499999999500000000

> rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs && ./main.exe
588.4743ms / 499999999500000000

> rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs && ./main.exe
138.5313ms / 499999999500000000

Disassembly here: https://godbolt.org/z/fzaGhGdWW

The tested optimization targets all generate different assembly with different levels of unrolling, but the znver4 and skylake targets seem to be outliers.

I don't know whether the skylake target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.

Split from #90985 (comment)

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-05-07T18:15:45Z

@llvm/issue-subscribers-backend-x86

Author: Chris (Systemcluster)

The following code runs around 300% slower on Zen 4 when optimized for `znver4` or `skylake` than when optimized for `znver3` or other targets.

pub fn sum(a: &amp;[i64]) -&gt; i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

pub fn sum(a: &amp;[i64]) -&gt; i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

fn main() {
    let nums = std::hint::black_box(generate());
    let now = std::time::Instant::now();
    let sum = sum(&amp;nums);
    println!("{:?} / {}", now.elapsed(), sum);
}

fn generate() -&gt; Vec&lt;i64&gt; {
    let mut v = Vec::new();
    for i in 0..1000000000 {
        v.push(i);
    }
    v
}

</details>

Running on a Ryzen 7950X:

&gt; rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
138.7342ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
136.2689ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
136.0648ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
543.1562ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
137.4426ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
588.4743ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
138.5313ms / 499999999500000000

Disassembly here: https://godbolt.org/z/fzaGhGdWW

The tested optimization targets all generate different assembly with different levels of unrolling, but the znver4 and skylake targets seem to be outliers.

I don't know whether the skylake target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.

Split from #90985 (comment)

chriselrod · 2024-05-12T11:14:57Z

It is needlessly using gather instructions with -Ctarget-cpu=skylake and with -Ctarget-cpu=znver4.
It is not using gather instructions for any of the other tested architectures.

Gather instructions are extremely slow: 1 uop per loaded element.

Someone should look into why it is using them when it does not need to.

Systemcluster · 2024-05-12T18:06:14Z

It seems to generate gather instructions when enabling the avx512f target-feature with specific target-cpus. This reproduces the regression with e.g. the znver3 target: https://godbolt.org/z/48nq66rzc

> rustc.exe -Ctarget-cpu=znver3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
548.638ms / 499999999500000000

It doesn't however when adding it to the x86-64-v4 target (it should already have it enabled): https://godbolt.org/z/WWoTTs8b3

> rustc.exe -Ctarget-cpu=x86-64-v4 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
142.0749ms / 499999999500000000

But it does when adding it to x86-64-v3: https://godbolt.org/z/7YPr7Y6qE

> rustc.exe -Ctarget-cpu=x86-64-v3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
562.7708ms / 499999999500000000

I assume the issue might be here:

llvm-project/llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Lines 5751 to 5762 in 17daa20

    
           int X86TTIImpl::getGatherOverhead() const { 
        
             // Some CPUs have more overhead for gather. The specified overhead is relative 
        
             // to the Load operation. "2" is the number provided by Intel architects. This 
        
             // parameter is used for cost estimation of Gather Op and comparison with 
        
             // other alternatives. 
        
             // TODO: Remove the explicit hasAVX512()?, That would mean we would only 
        
             // enable gather with a -march. 
        
             if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather())) 
        
               return 2; 
        
             return 1024; 
        
           }

Every AVX512-supporting target seems to be treated as having a miniscule gather overhead, which must be wrong.

(I'm not sure why x86-64-v4 is treated differently, how are the x86-64-vN targets defined?)

RKSimon · 2024-05-12T18:34:12Z

The costs are low for CPUs we assume to have fast gather/scatter (including anything that uses avx512 instructions). The numbers are definitely closer to top end Intel CPUs than Zen4, but are optimistic even then. The overhead costs are just a fudge, and are likely to only help with throughout costs for vectorisation, not sizelatency costs for unrolling etc.

RKSimon · 2024-05-12T18:43:07Z

@ganeshgit The znver4 scheduler model seems to be missing gather/scatter instruction entries. This alone won't fix the problem but helps with analysis to determine optimal cost model table numbers.

chriselrod · 2024-05-12T18:53:12Z

It generates gather for skylake which does not have AVX512, but not skylake-avx512 (which does).
Skylake hasAVX2(), and I'm guessing hasFastGather().

Zen4's gathers are just as fast as Skylake's: https://uops.info/html-instr/VGATHERQPD_YMM_VSIB_YMM_YMM.html
although requires 24 uops for Zen4, vs only 5 for Skylake.

At a throughput of 1/4 cycles, that is 8x slower than vmovupd (which is 2/1 cycle).

Note that this means even the scalar vmovsd, at 2 loads/1 cycles, would take 2 cycles to load 4 double, giving it twice the throughput in loads as vgatherqpd on Skylake (4 cycles).

If it speeds the rest of the code up by enabling vectorization, it can still be profitable. It is better than vmovsd + all the shufflevectors needed to assemble a SIMD vector. Especially if the loop is bottlenecked by compute even in the presence of gathers (in which case the gathers would essentially be free).
But it should be heavily penalized.

I don't know if the loop vectorizer (without vplan) can compare different options at all, but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken.
If vectorization is only possible in the presence of gather/scatter, ideally there'd be a check to see if the code is bottlenecked by load/store throughput. It so, gather/scatter actually hurt these -- even on fastGather() CPUs like Skylake, and thus should likely still be avoided.

phoebewang · 2024-05-13T00:56:02Z

but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken.

Do you mean these:

$ clang --help | grep -E '\-gather|scatter'
  -mno-gather             Disable generation of gather instructions in auto-vectorization(x86 only)
  -mno-scatter            Disable generation of scatter instructions in auto-vectorization(x86 only)

ganeshgit · 2024-05-13T10:30:56Z

It generates gather for skylake which does not have AVX512, but not skylake-avx512 (which does). Skylake hasAVX2(), and I'm guessing hasFastGather().

Zen4's gathers are just as fast as Skylake's: https://uops.info/html-instr/VGATHERQPD_YMM_VSIB_YMM_YMM.html although requires 24 uops for Zen4, vs only 5 for Skylake.

At a throughput of 1/4 cycles, that is 8x slower than vmovupd (which is 2/1 cycle).

Note that this means even the scalar vmovsd, at 2 loads/1 cycles, would take 2 cycles to load 4 double, giving it twice the throughput in loads as vgatherqpd on Skylake (4 cycles).

If it speeds the rest of the code up by enabling vectorization, it can still be profitable. It is better than vmovsd + all the shufflevectors needed to assemble a SIMD vector. Especially if the loop is bottlenecked by compute even in the presence of gathers (in which case the gathers would essentially be free). But it should be heavily penalized.

I don't know if the loop vectorizer (without vplan) can compare different options at all, but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken. If vectorization is only possible in the presence of gather/scatter, ideally there'd be a check to see if the code is bottlenecked by load/store throughput. It so, gather/scatter actually hurt these -- even on fastGather() CPUs like Skylake, and thus should likely still be avoided.

It is bit tricky on how the latencies are measured and reported in uops.info! As you indicated, an instruction serviced from the microcode ROM\sequencer is costly and can be preferred only based on the insn mix with which they operate. With high number of uops, the pipelines get choked heavily and so it always good to avoid the sequence unless a user knows and wants gather instruction at any cost. So, fastGather must be disabled for zen3 & zen4.

RKSimon · 2024-05-13T11:55:06Z

OK, so this looks like we just need to update the TTI gather/scatter overhead costs to not implicitly assume TuningFastGather for avx512 targets (I'm assuming we can reuse TuningFastGather for scatter as well).

All recent Intel CPUs (Skylake onward) already have TuningFastGather in their tuning flags so I don't think anything needs updating there.

There is the question of whether x86-64-v4 should still set TuningFastGather or not.

@phoebewang Do you know of any other side-effects we need to consider?

phoebewang · 2024-05-13T12:48:09Z

I don't know. CC @gpei-dev

RKSimon · 2024-05-13T17:28:08Z

An alternative is we just set the TuningPreferNoGather/TuningPreferNoScatter flags to znver4

ganeshgit · 2024-05-14T04:42:07Z

An alternative is we just set the TuningPreferNoGather/TuningPreferNoScatter flags to znver4

I will post the code changes for this. @phoebewang what is your take on having NoScatter and NoGather for x86-64-v4?

gpei-dev · 2024-05-14T05:05:33Z

Intel optimizes gather instructions for years, big core's gather performance is better than simulated gather most time.

raphlinus · 2024-12-02T06:26:09Z

I'm seeing this also. I'm seeing massive slowdowns on znver5 compared with haswell (running on AMD 370 HX). Here's my code:

    pub fn strip_scalar(scratch: &mut [f32; 4096], x: usize, width: usize, alphas: &[u32], color: [f32; 4]) {
        for (z, a) in scratch[x * 4..][..4 * width].chunks_exact_mut(16).zip(alphas) {
            for j in 0..4 {
                let mask_alpha = ((*a >> (j * 8)) & 0xff) as f32 * (1.0 / 255.0);
                let one_minus_alpha = 1.0 - mask_alpha * color[3];
                for i in 0..4 {
                    z[j * 4 + i] = z[j * 4 + i].mul_add(one_minus_alpha, mask_alpha * color[i]);
                }
            }
        }
    }

With target-cpu=haswell I see (the benchmark has x=0 and width=256):

test strip_scalar                ... bench:         590.08 ns/iter (+/- 2.91)

And with target-cpu=znver5 it's:

test strip_scalar                ... bench:       1,591.53 ns/iter (+/- 155.91)

In the generated asm I'm seeing a lot of extremely strange instruction choices, including lots of scatters and gathers and mask manipulation. I don't actually understand what it's doing or why.

For reference, my hand-optimized intrinsic code (not yet fully validated, so this number may change) is 262.86ns. It's not doing anything super clever, the core of it is _mm256_permutevar_ps to shuffle the data in place for the fmadd's. It feels like autovectorization should be able to get close.

chriselrod · 2024-12-02T13:17:56Z

The above also generates bad code (lots of gather and scatter) with -C target-cpu=skylake-avx512 and -C target-cpu=x86-64-v4.

RKSimon · 2024-12-02T15:46:55Z

Instead of NoScatter/NoGather - can we use a tuning flag that only prefers Scatter/Gather for variable pointer offset AND/OR variable masks?

github-actions bot added the new issue label May 7, 2024

EugeneZelenko added backend:X86 and removed new issue labels May 7, 2024

RKSimon assigned ganeshgit May 7, 2024

Zentrik mentioned this issue Oct 30, 2024

argmin() is much slower than findmin() due to automatic inline on AMD Zen4 JuliaLang/julia#56375

Open

raphlinus mentioned this issue Dec 12, 2024

Poor code generation with unpacking conversions ispc/ispc#3124

Open

zierf mentioned this issue Jan 16, 2025

Fix rust like c for Levenshtein bddicken/languages#351

Merged

zierf mentioned this issue Jan 25, 2025

Zig ReleaseSafe faster than ReleaseFast bddicken/languages#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake` #91370

[X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake` #91370

Systemcluster commented May 7, 2024

llvmbot commented May 7, 2024

chriselrod commented May 12, 2024 •

edited

Loading

Systemcluster commented May 12, 2024 •

edited

Loading

RKSimon commented May 12, 2024

RKSimon commented May 12, 2024

chriselrod commented May 12, 2024

phoebewang commented May 13, 2024

ganeshgit commented May 13, 2024

RKSimon commented May 13, 2024

phoebewang commented May 13, 2024

RKSimon commented May 13, 2024

ganeshgit commented May 14, 2024

gpei-dev commented May 14, 2024

raphlinus commented Dec 2, 2024 •

edited

Loading

chriselrod commented Dec 2, 2024 •

edited

Loading

RKSimon commented Dec 2, 2024

[X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4 or skylake #91370

[X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4 or skylake #91370

Comments

Systemcluster commented May 7, 2024

llvmbot commented May 7, 2024

chriselrod commented May 12, 2024 • edited Loading

Systemcluster commented May 12, 2024 • edited Loading

RKSimon commented May 12, 2024

RKSimon commented May 12, 2024

chriselrod commented May 12, 2024

phoebewang commented May 13, 2024

ganeshgit commented May 13, 2024

RKSimon commented May 13, 2024

phoebewang commented May 13, 2024

RKSimon commented May 13, 2024

ganeshgit commented May 14, 2024

gpei-dev commented May 14, 2024

raphlinus commented Dec 2, 2024 • edited Loading

chriselrod commented Dec 2, 2024 • edited Loading

RKSimon commented Dec 2, 2024

[X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake` #91370

[X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake` #91370

chriselrod commented May 12, 2024 •

edited

Loading

Systemcluster commented May 12, 2024 •

edited

Loading

raphlinus commented Dec 2, 2024 •

edited

Loading

chriselrod commented Dec 2, 2024 •

edited

Loading