-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<bit>: Is the __isa_available check for lzcnt worth the cost? #2849
Comments
I've benchmarked this on a Zen3 CPU (5950X) using google/benchmark using the following simple code: volatile unsigned int init;
static void BM_custom(benchmark::State& state) {
auto v = init;
for (auto _: state) {
benchmark::DoNotOptimize(bit_ceil(v++));
}
}
BENCHMARK(BM_custom); Here are the results:
It appears as if such a change would have a positive benefit even on the Zen3 architecture, despite |
#2133 previous investigation |
Ah, duh, I only searched for existing, open issues. 🤦♂️ The benchmark you posted there results in some "weird" numbers for me. This is fine so far:
It gets weird when I mix & match the functions, for instance:
The solution here appears to be to add a
So I plugged the test into google/benchmark: https://gist.github.com/lhecker/5ea8322a8c920312b5c2f3a7bedf06e4 This results in more stable numbers (at approx. 5GHz, edited for clarity):
(I love how This makes me believe that we should make the change to @AlexGuteniev What would your choice be? Given that the |
On an Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
|
A colleague kindly ran the benchmark on an i7-9700k PC for me with similar results:
|
So it looks like that micro-benchmark still shows that branch doesn't cost a lot, and its cost comparable to the unexplained variations, yet |
Plus, speculation will eventually make that branch "essentially free" because the value of __isa_available will never change. |
It does cost a lot: In binary size. To be specific:
This is practically the same argument for me as for the branchless So I ask: Is a 0.3-0.5 ns (1-2 cycles) improvement really "worth a branch"? Is it worth doubling the binary size of a function? Wouldn't a user of the STL have to be running a few million Basically, to summarize, my argument is identical to |
even so, it will take a space in the btb, and it may be cleared or evicted frequently. We could at least annotate the branch to indicate it's likely the CPU has lzcnt. It's really, really, really hard to benchmark the effects of code size. As a practical matter the additional code size could also prevent the function from being inlined at all, which would definitely be worse than the slow version had inlining happened (esp if that then goes and prevents autovectorization) if we get really creative we could eliminate the btb entry by making the check a static branch with a "JUMP_LABEL" type construct, but that could make code size worse since it may break page sharing for that page. We could also move the bsr verison into a function, so it won't be inlined with the lzcnt code (ofc if you actually need bsr that is rather painful). llvm-mca results for CPUs that are slow with bsr would be interesting, since that could show some estimate of the impact of the btb entry and I$ difference. I'm pretty warm to this idea. |
My opinion is that size impact is worth the optimization. The default Also, the main reason to prefer
Might get us something, but not too much. The call instruction itself still has a significant size. |
I have wonderful news then! 🙂 Here's how it works: By the way, I just noticed that the benchmark gist I posted had a bug - it used |
We talked about this at the weekly maintainer meeting. We don't yet have sufficient information to reach a conclusion - what we'd like to see is:
|
@StephanTLavavej I hope the summary below is alright. Let me know if I should expand on anything in particular. 🙂
All AMD CPUs have slow BSR (including Zen 3). All Intel CPUs have fast BSR (at least on par with
a) Real world time (inconsistent results for 9700k):
b) CPU time (consistent results for 9700k):
Binary size:
I'm not familiar with how to run llvm-mca with MSVC as the compiler unfortunately.
My thoughts given the above:
|
According to these numbers, on newer Intel CPUs the branchless BSR version only is 2.68% faster, while it results in a 30.33% throughput loss on new AMD CPUs. I do not really see a major win in going branchless, because it makes Intel processors only a sliver faster, but makes AMD CPUs a third slower. The win in code size is nice, but if codesize is that important, code should be compiled with |
FYI The 9700k isn't newer than the 8750H. They're both the same architecture but a different stepping with the same performance characteristics (Coffee Lake H and S). My understanding is that the difference is caused by the desktop vs. mobile power envelope and frequency scaling. If we measure the actual CPU time spent running the benchmark, improvement on Intel is consistent:
Was there ever a decision what this STL is optimized for? Real time throughput or CPU time throughput (or: workstations vs. laptops, unconstrained vs. constrained thermal/power envelope)? As seen above they aren't always identical and optimizing for one might sometimes be to the detriment of the other. |
Came here to see if there was already a thread on optimising the BSR-based fallback, am not disappointed. I'm afraid I don't have the right tools set up to benchmark this, so apologies for not taking further steps to try and get this over the line. That said, I hope that you can use this suggestion in your advantage! |
std::bit_ceil
emits quite a bit of assembly on x64. This seems to occur mostly due the branch in_Checked_x86_x64_countl_zero
and to a lesser extent due to the branch inbit_ceil
itself.I've written a variant which produces a more compact result: https://godbolt.org/z/q4EEz83aW
(It also removes the extra branch on ARM64 by using conditional assignments.)
I've checked the PR that introduced the code (#795) and it appears as if the cost of this if condition wasn't discussed. The if condition generally makes sense though:
bsr
is costly on AMD CPUs (up to Zen3, 4 cycles/op latency), whereaslzcnt
is very fast on any architecture (<= 1 cycle).But it takes up 3 slots in the CPU's branch target buffer (contemporary hardware has ~4096 slots, newly added branches incur an extra 5-20 cycle latency), generates larger binaries after inlining and unfortunately the added instructions seem to add about ~5 cycles of latency themselves, offsetting the cost of
bsr
.This makes me wonder: Should we drop the
__isa_available
check?The text was updated successfully, but these errors were encountered: