-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4
or skylake
#91370
Comments
@llvm/issue-subscribers-backend-x86 Author: Chris (Systemcluster)
The following code runs around 300% slower on Zen 4 when optimized for `znver4` or `skylake` than when optimized for `znver3` or other targets.
pub fn sum(a: &[i64]) -> i64 {
let mut sum = 0;
a.chunks_exact(8).for_each(|x| {
for i in x {
sum += i;
}
});
sum
} <details> pub fn sum(a: &[i64]) -> i64 {
let mut sum = 0;
a.chunks_exact(8).for_each(|x| {
for i in x {
sum += i;
}
});
sum
}
fn main() {
let nums = std::hint::black_box(generate());
let now = std::time::Instant::now();
let sum = sum(&nums);
println!("{:?} / {}", now.elapsed(), sum);
}
fn generate() -> Vec<i64> {
let mut v = Vec::new();
for i in 0..1000000000 {
v.push(i);
}
v
} </details> Running on a Ryzen 7950X: > rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs && ./main.exe
138.7342ms / 499999999500000000
> rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs && ./main.exe
136.2689ms / 499999999500000000
> rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs && ./main.exe
136.0648ms / 499999999500000000
> rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs && ./main.exe
543.1562ms / 499999999500000000
> rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs && ./main.exe
137.4426ms / 499999999500000000
> rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs && ./main.exe
588.4743ms / 499999999500000000
> rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs && ./main.exe
138.5313ms / 499999999500000000 Disassembly here: https://godbolt.org/z/fzaGhGdWW The tested optimization targets all generate different assembly with different levels of unrolling, but the I don't know whether the Split from #90985 (comment) |
It is needlessly using gather instructions with Gather instructions are extremely slow: 1 uop per loaded element. Someone should look into why it is using them when it does not need to. |
It seems to generate gather instructions when enabling the > rustc.exe -Ctarget-cpu=znver3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
548.638ms / 499999999500000000 It doesn't however when adding it to the > rustc.exe -Ctarget-cpu=x86-64-v4 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
142.0749ms / 499999999500000000 But it does when adding it to > rustc.exe -Ctarget-cpu=x86-64-v3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
562.7708ms / 499999999500000000 I assume the issue might be here: llvm-project/llvm/lib/Target/X86/X86TargetTransformInfo.cpp Lines 5751 to 5762 in 17daa20
Every AVX512-supporting target seems to be treated as having a miniscule gather overhead, which must be wrong. (I'm not sure why |
The costs are low for CPUs we assume to have fast gather/scatter (including anything that uses avx512 instructions). The numbers are definitely closer to top end Intel CPUs than Zen4, but are optimistic even then. The overhead costs are just a fudge, and are likely to only help with throughout costs for vectorisation, not sizelatency costs for unrolling etc. |
@ganeshgit The znver4 scheduler model seems to be missing gather/scatter instruction entries. This alone won't fix the problem but helps with analysis to determine optimal cost model table numbers. |
It generates gather for skylake which does not have AVX512, but not skylake-avx512 (which does). Zen4's gathers are just as fast as Skylake's: https://uops.info/html-instr/VGATHERQPD_YMM_VSIB_YMM_YMM.html At a throughput of 1/4 cycles, that is 8x slower than Note that this means even the scalar If it speeds the rest of the code up by enabling vectorization, it can still be profitable. It is better than I don't know if the loop vectorizer (without vplan) can compare different options at all, but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken. |
Do you mean these:
|
It is bit tricky on how the latencies are measured and reported in uops.info! As you indicated, an instruction serviced from the microcode ROM\sequencer is costly and can be preferred only based on the insn mix with which they operate. With high number of uops, the pipelines get choked heavily and so it always good to avoid the sequence unless a user knows and wants gather instruction at any cost. So, fastGather must be disabled for zen3 & zen4. |
OK, so this looks like we just need to update the TTI gather/scatter overhead costs to not implicitly assume TuningFastGather for avx512 targets (I'm assuming we can reuse TuningFastGather for scatter as well). All recent Intel CPUs (Skylake onward) already have TuningFastGather in their tuning flags so I don't think anything needs updating there. There is the question of whether x86-64-v4 should still set TuningFastGather or not. @phoebewang Do you know of any other side-effects we need to consider? |
I don't know. CC @gpei-dev |
An alternative is we just set the TuningPreferNoGather/TuningPreferNoScatter flags to znver4 |
I will post the code changes for this. @phoebewang what is your take on having NoScatter and NoGather for x86-64-v4? |
Intel optimizes gather instructions for years, big core's gather performance is better than simulated gather most time. |
I'm seeing this also. I'm seeing massive slowdowns on znver5 compared with haswell (running on AMD 370 HX). Here's my code: pub fn strip_scalar(scratch: &mut [f32; 4096], x: usize, width: usize, alphas: &[u32], color: [f32; 4]) {
for (z, a) in scratch[x * 4..][..4 * width].chunks_exact_mut(16).zip(alphas) {
for j in 0..4 {
let mask_alpha = ((*a >> (j * 8)) & 0xff) as f32 * (1.0 / 255.0);
let one_minus_alpha = 1.0 - mask_alpha * color[3];
for i in 0..4 {
z[j * 4 + i] = z[j * 4 + i].mul_add(one_minus_alpha, mask_alpha * color[i]);
}
}
}
} With
And with
In the generated asm I'm seeing a lot of extremely strange instruction choices, including lots of scatters and gathers and mask manipulation. I don't actually understand what it's doing or why. For reference, my hand-optimized intrinsic code (not yet fully validated, so this number may change) is 262.86ns. It's not doing anything super clever, the core of it is |
The above also generates bad code (lots of gather and scatter) with |
Instead of NoScatter/NoGather - can we use a tuning flag that only prefers Scatter/Gather for variable pointer offset AND/OR variable masks? |
The following code runs around 300% slower on Zen 4 when optimized for
znver4
orskylake
than when optimized forznver3
or other targets.Full code
Running on a Ryzen 7950X:
Disassembly here: https://godbolt.org/z/fzaGhGdWW
The tested optimization targets all generate different assembly with different levels of unrolling, but the
znver4
andskylake
targets seem to be outliers.I don't know whether the
skylake
target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.Split from #90985 (comment)
The text was updated successfully, but these errors were encountered: