<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

AlexGuteniev · 2021-08-18T04:10:26Z

Inspired by #2097 .

Try this code:

Lines 2496 to 2504 in 737ce4a

    
           // Since the bit width of 0 is 0x0, special-case it instead of complicating the math even more.
 
           int _Width = 3;
 
           if (_Value != nullptr) {
 
               // Compute the bit width of the pointer (i.e. how many bits it takes to be represented).
 
               // Add 3 to the bit width so we always round up on the division.
 
               // Divide that by the amount of bits a hexit represents (log2(16) = log2(2^4) = 4).
 
               // Add 2 for the 0x prefix.
 
               _Width = static_cast<int>(2 + (_STD bit_width(reinterpret_cast<uintptr_t>(_Value)) + 3) / 4);
 
           }

On Godbolt's compiler explorer: https://godbolt.org/z/z99vjeEhh

AlexGuteniev · 2021-08-18T04:34:08Z

Actually not much better. MSVC optimizer is far from perfect for this case

StephanTLavavej · 2021-08-18T21:52:29Z

@barcharcraz notes that these functions are typically called in perf-critical loops, so micro-optimizations can be important.

Additionally, I note that as long as we don't regress MSVC codegen quality, changes could be worthwhile if they improve Clang/LLVM codegen quality.

AlexGuteniev · 2021-11-14T15:45:02Z

Tried to benchmark what is better and got controversial results.

AlexGuteniev · 2021-11-14T16:28:52Z

The bsr countl_zero:

STL/stl/inc/bit

Lines 149 to 176 in 3c2fd04

    
           template <class _Ty>
 
           _NODISCARD int _Countl_zero_bsr(const _Ty _Val) noexcept {
 
               constexpr int _Digits = numeric_limits<_Ty>::digits;
 
               unsigned long _Result;
 
               if constexpr (_Digits <= 32) {
 
                   if (!_BitScanReverse(&_Result, _Val)) {
 
                       return _Digits;
 
                   }
 
               } else {
 
           #ifdef _M_IX86
 
                   const unsigned int _High = _Val >> 32;
 
                   if (_BitScanReverse(&_Result, _High)) {
 
                       return static_cast<int>(31 - _Result);
 
                   }
 
                   const auto _Low = static_cast<unsigned int>(_Val);
 
                   if (!_BitScanReverse(&_Result, _Low)) {
 
                       return _Digits;
 
                   }
 
           #else // ^^^ _M_IX86 / !_M_IX86 vvv
 
                   if (!_BitScanReverse64(&_Result, _Val)) {
 
                       return _Digits;
 
                   }
 
           #endif // _M_IX86
 
               }
 
               return static_cast<int>(_Digits - 1 - _Result);
 
           }

Is branchy, and avoiding it seem to be a worth thing.

The bsf countr_zero

STL/stl/inc/limits

Lines 1092 to 1124 in 3c2fd04

    
           _NODISCARD int _Countr_zero_bsf(const _Ty _Val) noexcept {
 
               constexpr int _Digits = numeric_limits<_Ty>::digits;
 
               constexpr _Ty _Max    = (numeric_limits<_Ty>::max) ();
 
               unsigned long _Result;
 
               if constexpr (_Digits <= 32) {
 
                   // Intended widening to int. This operation means that a narrow 0 will widen
 
                   // to 0xFFFF....FFFF0... instead of 0. We need this to avoid counting all the zeros
 
                   // of the wider type.
 
                   if (!_BitScanForward(&_Result, static_cast<unsigned int>(~_Max | _Val))) {
 
                       return _Digits;
 
                   }
 
               } else {
 
           #ifdef _M_IX86
 
                   const auto _Low = static_cast<unsigned int>(_Val);
 
                   if (_BitScanForward(&_Result, _Low)) {
 
                       return static_cast<int>(_Result);
 
                   }
 
                   const unsigned int _High = _Val >> 32;
 
                   if (!_BitScanForward(&_Result, _High)) {
 
                       return _Digits;
 
                   } else {
 
                       return static_cast<int>(_Result + 32);
 
                   }
 
           #else // ^^^ _M_IX86 / !_M_IX86 vvv
 
                   if (!_BitScanForward64(&_Result, _Val)) {
 
                       return _Digits;
 
                   }
 
           #endif // _M_IX86
 
               }
 
               return static_cast<int>(_Result);
 
           }

Is branchless (cmov is used), and avoiding it seem to be not really useful.

StephanTLavavej · 2021-12-01T23:15:28Z

We'd like to see the specific perf data before making a decision here.

AlexGuteniev · 2021-12-03T10:21:08Z

@StephanTLavavej , I figured it out!

So overall the advantage of lzcnt/tzcnt worth a branch with fallback, so the implementation is optimal as it is!

All animalities I see are explained by Intel JCC Errata. Compiling with /QIntel-jcc-erratum fixes the performance for these cases.

I've created DevCom-1603517 to make /QIntel-jcc-erratum default unless tuning for AMD.

Otherwise the data suggests that we should do dynamic dispatch, the occurrences where static-only dispatch wins are purely random. I'm closing this.

CaseyCarter · 2021-12-03T15:08:05Z

Great work, @AlexGuteniev - thanks for digging into this.

StephanTLavavej added the performance Must go faster label Aug 18, 2021

AlexGuteniev mentioned this issue Nov 11, 2021

Remove attempt to use tzcnt as bsf #2333

Merged

AlexGuteniev changed the title ~~<bit>: consider leaving only static CPU dispatch for lzcnt/bsr~~ <bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr Nov 14, 2021

AlexGuteniev mentioned this issue Nov 17, 2021

add explicit non-zero context to _Countr_zero #2298

Closed

This comment has been minimized.

Sign in to view

AlexGuteniev closed this as completed Dec 3, 2021

AlexGuteniev mentioned this issue Jul 7, 2022

<bit>: Is the __isa_available check for lzcnt worth the cost? #2849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

AlexGuteniev commented Aug 18, 2021 •

edited

Loading

AlexGuteniev commented Aug 18, 2021

StephanTLavavej commented Aug 18, 2021

AlexGuteniev commented Nov 14, 2021

AlexGuteniev commented Nov 14, 2021

StephanTLavavej commented Dec 1, 2021

This comment has been minimized.

AlexGuteniev commented Dec 3, 2021

CaseyCarter commented Dec 3, 2021

<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

<bit>: consider leaving only static CPU dispatch for lzcnt/bsr and tzcnt/bsr #2133

Comments

AlexGuteniev commented Aug 18, 2021 • edited Loading

AlexGuteniev commented Aug 18, 2021

StephanTLavavej commented Aug 18, 2021

AlexGuteniev commented Nov 14, 2021

AlexGuteniev commented Nov 14, 2021

StephanTLavavej commented Dec 1, 2021

This comment has been minimized.

AlexGuteniev commented Dec 3, 2021

CaseyCarter commented Dec 3, 2021

AlexGuteniev commented Aug 18, 2021 •

edited

Loading