<xcharconv_ryu.h>
: Fix __ryu_shiftright128
for ARM64
#4304
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When I ported @ulfjack's amazing Ryu and Ryu Printf to MSVC
<charconv>
(making it compatible with MSVC and C++, adding buffer bounds checking, etc.), I extensively validated correctness and tried to improve performance for both x64 and x86, on both MSVC and Clang/LLVM. x64 had convenient, fast intrinsics, while x86 relied on plain old (bitwise) arithmetic. Life was simple.Over time, our codebase was modified to support ARM, ARM64, and more exotic flavors like ARM64EC, but we've long lacked comprehensive runtime test coverage. Now that's beginning to change, and Andrew Dean from the MSVC compiler back-end team discovered runtime failures, indicated by an
_STL_INTERNAL_CHECK
in__ryu_shiftright128
. If built without internal checks, the code would fail at runtime due to incorrect results, so it wasn't just an over-eager assertion.I haven't totally tracked down the codebase archaeology, but it appears that as the code was altered to support ARM64, the original simplicity of "we have x64 intrinsics" vs. "we don't" was broken. In particular,
_HAS_CHARCONV_INTRINSICS
is enabled for both x64 and ARM64, despite ARM64 not having the full set of convenient intrinsics:STL/stl/inc/xcharconv_ryu.h
Lines 51 to 53 in 87580ca
Then we have a codepath that's guarded by
_HAS_CHARCONV_INTRINSICS
and (as the comment says) can pass__dist
in the range[0, 52]
to__ryu_shiftright128
. It's sometimes exactly0
here:STL/stl/inc/xcharconv_ryu.h
Lines 390 to 393 in 87580ca
For x64 proper,
__ryu_shiftright128
is powered by the__shiftright128
intrinsic which can handle anything in[0, 63]
for sure:STL/stl/inc/xcharconv_ryu.h
Lines 195 to 218 in 87580ca
The comment in the x64 codepath, "(The shift value is in the range [49, 58].)" appears to be bogus given the callsite. (I think the comment may have been correct for Ryu in isolation, but Ryu Printf was added later.) We were just getting away with it because the x64 intrinsic handles the full range.
Then, the ARM64 codepath below has an assumption checked by
_STL_INTERNAL_CHECK(__dist > 0)
, which was being triggered. When__dist == 0
, the following code performed a "full shift" (i.e. trying to shift all 64 bits out at once), which famously/notoriously has undefined behavior (and ARM64's physical behavior makes this non-innocuous).The "32-bit" codepath at the bottom has an even stronger assumption in the name of micro-optimization (
__dist >= 32
) that I had contributed in ulfjack/ryu#77. I think we were getting away with this one because_HAS_CHARCONV_INTRINSICS
prevents small__dist
values from being sent to it, but I am no longer confident of this given the exotic flavors we've accumulated (hybrid x86/ARM64 etc.).My fix is to blast away the complexity we accumulated here. When the x64 intrinsic is unavailable, we fall back to plain code with a special case for zero-shifting. This is unquestionably correct for everything in the range
[0, 63]
. Note that this may slightly degrade performance for x86, but I simply no longer care about that anymore (it is less relevant every year, if anyone cares about perf they should be using x64, and in absolute terms Ryu and Ryu Printf are still incredibly fast, we're talking mere nanoseconds either way).With this fix, I've verified that the following test directories (originally reported by Andrew Dean) are now passing at runtime on ARM64 (by temporarily removing
"TESTS_BUILD_ONLY": true
inCMakePresets.json
):tests/std/tests/P0067R5_charconv
tests/std/tests/P0645R10_text_formatting_formatting
llvm-project/libcxx/test/std/utilities/charconv/charconv.msvc