-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer: add SIMD Neon optimization for byteLength
#48009
Conversation
cc @nodejs/cpp-reviewers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my ARM system this is already correctly auto-vectorized in the current node release:
$ objdump -d --disassemble-symbols='__ZN4node6Buffer12_GLOBAL__N_118FastByteLengthUtf8EN2v85LocalINS2_5ValueEEERKNS2_17FastOneByteStringE' node-v20.1.0-darwin-arm64/bin/node
...
1000abae4: 74 d5 7f ad ldp q20, q21, [x11, #-16]
1000abae8: 94 06 09 6f ushr.16b v20, v20, #7
1000abaec: b5 06 09 6f ushr.16b v21, v21, #7
1000abaf0: 96 02 01 4e tbl.16b v22, { v20 }, v1
...
Is auto-vectorization not happening for the other builds? If so, can we figure out why? I think it would be nicer than hand-coding SIMD equivalents for each system.
@kvakil Autovec. is happening but it is unclear whether it is producing highly optimized code. What is your sentiment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lemire: the produced auto-vectorized code looks largely fine to me. (edit: of course, I am sure we can do better.)
Higher values of repeat
don't exercise this code path since at repeat != 1
, V8 creates cons strings, not seq strings. It would be nice to actually benchmark with larger strings by creating a flat string.
Requesting changes for the out-of-bounds read & correctness problems.
the fastest I was able to get on my system was using shrn+popcount. On my system it's ~3x faster than the compiler's version for large sizes in a tight loop. #include <arm_neon.h>
uint64_t mine(const uint8_t *data, uint32_t length) {
uint64_t res = 0;
const int lanes = 16;
uint8_t rem = length % lanes;
const auto *simd_end = data + (length / lanes) * lanes;
const auto threshold = vdupq_n_u8(0x80);
for (; data < simd_end; data += lanes)
res += __builtin_popcountll(vget_lane_u64(
vreinterpret_u64_u8(
vshrn_n_u16(vreinterpretq_u16_u8(
vcgeq_u8(vld1q_u8(data), threshold)),
4)),
0));
res >>= 2;
// This unrolling is a little greedy & I would probably not do it.
if (rem >= lanes) __builtin_unreachable();
#pragma clang loop unroll_count(16)
for (uint8_t j = 0; j < rem; j++)
res += simd_end[j] >> 7;
return res + length;
} benchmark gist with the results I'm not sure how well this generalizes to other ARM systems. also not sure if the loop unrolling at the end is actually useful in practice or if it just manipulates the benchmark. |
@kvakil I think that @anonrig should adopt your approach. It is going to be hard to beat it, at least on an Apple laptop: Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition) We will borrow this (with credit) for the simdutf library. |
@kvakil Your solution is 2 times faster than mine. Amazing.
|
2035fa0
to
d0d1eb7
Compare
src/node_buffer.cc
Outdated
auto data = reinterpret_cast<const uint8_t*>(source.data); | ||
auto length = source.length; | ||
|
||
uint32_t result{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because result is a 32-bit integer and we are overcounting by a factor of 4, there might be overflow if a string exceeds a gigabyte or so. If you use a 64-bit counter, then you are more likely to be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V8 Fast API does not support returning uin64_t. Therefore, I don't think it will ever exceed the limit and overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anonrig I meant using a 64-bit counter internally. I did not mean that you should change the function signature. Suppose you have a 2 GB input string made of the character é
. Then you'd have an overflow.
(This is nitpicking. Your function is correct, and will only fail in really huge strings where you'd have other problems anyhow...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at least add an assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can’t say “This won’t overflow because we can’t use a larger type”.
What you can do is to use an uint64_t
for result
, then check before returning that it fits into an uint32_t
. If you don’t want do to that, you could at least add something like static_assert(String::kMaxLength < std::numeric_limits<uint32_t>::max() / 8);
(but I’m not sure if that assertion actually holds – the first approach here is definitely preferable).
@anonrig I think that there is a simpler approach that might be faster on some systems... for (; data < simd_end; data += lanes) {
// load 16 bytes
uint8x16_t input = vld1q_u8(data);
// compare to threshold (0x80)
uint8x16_t withhighbit = vcgeq_u8(input, threshold);
// vertical addition
result -= vaddvq_s8(withhighbit);
} |
General observation: I don't really want to see big chunks of ifdef'ed platform- or architecture-specific code in general-purpose source files. That kind of code tends to scare away new contributors. Try to abstract it away so that people who don't know or care about neon or simd don't have to look at or even think about it. |
Yet another way to do it... for (; data < simd_end; data += simd_lanes) {
uint8x16_t chunk = vld1q_u8(data);
uint8x16_t high_bits = vshrq_n_u8(chunk, 7);
result += vaddvq_u8(high_bits);
} |
@bnoordhuis I agree. I moved the implementation to |
9494cb7
to
a29a70d
Compare
Co-authored-by: Keyhan Vakil <kvakil@sylph.kvakil.me> Co-authored-by: Daniel Lemire <daniel@lemire.me>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?
Does this need test or is it included in existing ones? |
Closing the pull request due to @lemire is pursuing to add latin1 support to simdutf. |
Ref nodejs/performance#52
Since, benchmark CI does not have ARM processor, here's the benchmark results from my local: