Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

Closed
AlexGuteniev opened this issue Aug 18, 2021 · 8 comments
Labels
performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Aug 18, 2021

Inspired by #2097 .

Try this code:

STL/stl/inc/format

Lines 2496 to 2504 in 737ce4a

// Since the bit width of 0 is 0x0, special-case it instead of complicating the math even more.
int _Width = 3;
if (_Value != nullptr) {
// Compute the bit width of the pointer (i.e. how many bits it takes to be represented).
// Add 3 to the bit width so we always round up on the division.
// Divide that by the amount of bits a hexit represents (log2(16) = log2(2^4) = 4).
// Add 2 for the 0x prefix.
_Width = static_cast<int>(2 + (_STD bit_width(reinterpret_cast<uintptr_t>(_Value)) + 3) / 4);
}

On Godbolt's compiler explorer: https://godbolt.org/z/z99vjeEhh

@AlexGuteniev
Copy link
Contributor Author

Actually not much better. MSVC optimizer is far from perfect for this case

@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 18, 2021
@StephanTLavavej
Copy link
Member

@barcharcraz notes that these functions are typically called in perf-critical loops, so micro-optimizations can be important.

Additionally, I note that as long as we don't regress MSVC codegen quality, changes could be worthwhile if they improve Clang/LLVM codegen quality.

@AlexGuteniev AlexGuteniev changed the title <bit>: consider leaving only static CPU dispatch for lzcnt/bsr <bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr Nov 14, 2021
@AlexGuteniev
Copy link
Contributor Author

Tried to benchmark what is better and got controversial results.

@AlexGuteniev
Copy link
Contributor Author

The bsr countl_zero:

STL/stl/inc/bit

Lines 149 to 176 in 3c2fd04

template <class _Ty>
_NODISCARD int _Countl_zero_bsr(const _Ty _Val) noexcept {
constexpr int _Digits = numeric_limits<_Ty>::digits;
unsigned long _Result;
if constexpr (_Digits <= 32) {
if (!_BitScanReverse(&_Result, _Val)) {
return _Digits;
}
} else {
#ifdef _M_IX86
const unsigned int _High = _Val >> 32;
if (_BitScanReverse(&_Result, _High)) {
return static_cast<int>(31 - _Result);
}
const auto _Low = static_cast<unsigned int>(_Val);
if (!_BitScanReverse(&_Result, _Low)) {
return _Digits;
}
#else // ^^^ _M_IX86 / !_M_IX86 vvv
if (!_BitScanReverse64(&_Result, _Val)) {
return _Digits;
}
#endif // _M_IX86
}
return static_cast<int>(_Digits - 1 - _Result);
}

Is branchy, and avoiding it seem to be a worth thing.

The bsf countr_zero

STL/stl/inc/limits

Lines 1092 to 1124 in 3c2fd04

_NODISCARD int _Countr_zero_bsf(const _Ty _Val) noexcept {
constexpr int _Digits = numeric_limits<_Ty>::digits;
constexpr _Ty _Max = (numeric_limits<_Ty>::max) ();
unsigned long _Result;
if constexpr (_Digits <= 32) {
// Intended widening to int. This operation means that a narrow 0 will widen
// to 0xFFFF....FFFF0... instead of 0. We need this to avoid counting all the zeros
// of the wider type.
if (!_BitScanForward(&_Result, static_cast<unsigned int>(~_Max | _Val))) {
return _Digits;
}
} else {
#ifdef _M_IX86
const auto _Low = static_cast<unsigned int>(_Val);
if (_BitScanForward(&_Result, _Low)) {
return static_cast<int>(_Result);
}
const unsigned int _High = _Val >> 32;
if (!_BitScanForward(&_Result, _High)) {
return _Digits;
} else {
return static_cast<int>(_Result + 32);
}
#else // ^^^ _M_IX86 / !_M_IX86 vvv
if (!_BitScanForward64(&_Result, _Val)) {
return _Digits;
}
#endif // _M_IX86
}
return static_cast<int>(_Result);
}

Is branchless (cmov is used), and avoiding it seem to be not really useful.

@StephanTLavavej
Copy link
Member

We'd like to see the specific perf data before making a decision here.

@AlexGuteniev

This comment has been minimized.

@AlexGuteniev
Copy link
Contributor Author

@StephanTLavavej , I figured it out!

So overall the advantage of lzcnt/tzcnt worth a branch with fallback, so the implementation is optimal as it is!

All animalities I see are explained by Intel JCC Errata. Compiling with /QIntel-jcc-erratum fixes the performance for these cases.

I've created DevCom-1603517 to make /QIntel-jcc-erratum default unless tuning for AMD.

Otherwise the data suggests that we should do dynamic dispatch, the occurrences where static-only dispatch wins are purely random. I'm closing this.

@CaseyCarter
Copy link
Member

Great work, @AlexGuteniev - thanks for digging into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants