buffer: add SIMD Neon optimization for `byteLength` #48009

anonrig · 2023-05-14T19:43:35Z

Since, benchmark CI does not have ARM processor, here's the benchmark results from my local:

node benchmark/compare.js --old ./node-main --new ./out/Release/node --filter buffer-bytelength buffers > bytelength.csv && node-benchmark-compare bytelength.csv
[00:18:29|% 100| 2/2 files | 60/60 runs | 32/32 configs]: Done
                                                                                               confidence improvement accuracy (*)   (**)  (***)
buffers/buffer-bytelength-buffer.js n=4000000 len=16                                                          0.39 %       ±0.77% ±1.03% ±1.35%
buffers/buffer-bytelength-buffer.js n=4000000 len=2                                                          -0.02 %       ±0.45% ±0.60% ±0.78%
buffers/buffer-bytelength-buffer.js n=4000000 len=256                                                         1.97 %       ±3.41% ±4.60% ±6.10%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='four_bytes'                    1.71 %       ±3.63% ±4.84% ±6.31%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='one_byte'                      0.03 %       ±0.68% ±0.90% ±1.17%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='three_bytes'                   2.16 %       ±2.97% ±3.98% ±5.24%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='two_bytes'                    -1.08 %       ±2.05% ±2.76% ±3.65%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='four_bytes'                     -0.13 %       ±0.96% ±1.28% ±1.67%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='one_byte'               ***      4.59 %       ±0.77% ±1.02% ±1.33%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='three_bytes'                     0.21 %       ±0.61% ±0.82% ±1.07%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='two_bytes'                      -0.30 %       ±0.43% ±0.57% ±0.74%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='four_bytes'                  -0.68 %       ±3.18% ±4.24% ±5.53%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='one_byte'                     0.65 %       ±2.05% ±2.75% ±3.61%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='three_bytes'                 -0.89 %       ±1.98% ±2.66% ±3.51%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='two_bytes'                    0.10 %       ±1.44% ±1.92% ±2.50%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='four_bytes'                    -1.39 %       ±2.75% ±3.70% ±4.91%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='one_byte'                      -0.97 %       ±1.35% ±1.82% ±2.40%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='three_bytes'                    0.94 %       ±1.67% ±2.25% ±2.98%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='two_bytes'                      0.19 %       ±0.47% ±0.63% ±0.83%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='four_bytes'                   -1.67 %       ±2.60% ±3.50% ±4.63%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='one_byte'                      0.77 %       ±1.95% ±2.63% ±3.48%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='three_bytes'                   2.70 %       ±3.23% ±4.34% ±5.73%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='two_bytes'                     0.02 %       ±1.06% ±1.41% ±1.83%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='four_bytes'                      0.24 %       ±0.61% ±0.81% ±1.06%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='one_byte'                        0.04 %       ±2.03% ±2.70% ±3.53%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='three_bytes'                     1.33 %       ±2.73% ±3.67% ±4.87%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='two_bytes'                       0.10 %       ±0.40% ±0.53% ±0.69%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='four_bytes'                 -0.16 %       ±3.65% ±4.85% ±6.32%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='one_byte'                    0.47 %       ±2.66% ±3.54% ±4.60%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='three_bytes'                -0.51 %       ±0.78% ±1.04% ±1.37%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='two_bytes'                  -1.66 %       ±2.60% ±3.49% ±4.62%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='four_bytes'                   -0.42 %       ±0.54% ±0.72% ±0.94%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='one_byte'                     -0.07 %       ±0.32% ±0.43% ±0.56%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='three_bytes'                  -0.49 %       ±0.69% ±0.93% ±1.21%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='two_bytes'              *     -0.53 %       ±0.49% ±0.65% ±0.85%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 35 comparisons, you can thus expect the following amount of false-positive results:
  1.75 false positives, when considering a   5% risk acceptance (*, **, ***),
  0.35 false positives, when considering a   1% risk acceptance (**, ***),
  0.04 false positives, when considering a 0.1% risk acceptance (***)

anonrig · 2023-05-14T19:48:27Z

cc @nodejs/cpp-reviewers

nodejs-github-bot · 2023-05-14T22:05:09Z

CI: https://ci.nodejs.org/job/node-test-pull-request/51798/

nodejs-github-bot · 2023-05-15T00:16:36Z

CI: https://ci.nodejs.org/job/node-test-pull-request/51800/

kvakil

On my ARM system this is already correctly auto-vectorized in the current node release:

$ objdump -d --disassemble-symbols='__ZN4node6Buffer12_GLOBAL__N_118FastByteLengthUtf8EN2v85LocalINS2_5ValueEEERKNS2_17FastOneByteStringE' node-v20.1.0-darwin-arm64/bin/node
...
1000abae4: 74 d5 7f ad  ldp     q20, q21, [x11, #-16]
1000abae8: 94 06 09 6f  ushr.16b        v20, v20, #7
1000abaec: b5 06 09 6f  ushr.16b        v21, v21, #7
1000abaf0: 96 02 01 4e  tbl.16b v22, { v20 }, v1
...

Is auto-vectorization not happening for the other builds? If so, can we figure out why? I think it would be nicer than hand-coding SIMD equivalents for each system.

lemire · 2023-05-15T02:01:27Z

@kvakil Autovec. is happening but it is unclear whether it is producing highly optimized code. What is your sentiment?

kvakil

@lemire: the produced auto-vectorized code looks largely fine to me. (edit: of course, I am sure we can do better.)

Higher values of repeat don't exercise this code path since at repeat != 1, V8 creates cons strings, not seq strings. It would be nice to actually benchmark with larger strings by creating a flat string.

Requesting changes for the out-of-bounds read & correctness problems.

src/node_buffer.cc

kvakil · 2023-05-15T07:41:31Z

the fastest I was able to get on my system was using shrn+popcount. On my system it's ~3x faster than the compiler's version for large sizes in a tight loop.

#include <arm_neon.h>
uint64_t mine(const uint8_t *data, uint32_t length) {
  uint64_t res = 0;
  const int lanes = 16;
  uint8_t rem = length % lanes;
  const auto *simd_end = data + (length / lanes) * lanes;
  const auto threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes)
    res += __builtin_popcountll(vget_lane_u64(
        vreinterpret_u64_u8(
            vshrn_n_u16(vreinterpretq_u16_u8(
                            vcgeq_u8(vld1q_u8(data), threshold)),
                        4)),
        0));

  res >>= 2;

  // This unrolling is a little greedy & I would probably not do it.
  if (rem >= lanes) __builtin_unreachable();
  #pragma clang loop unroll_count(16)
  for (uint8_t j = 0; j < rem; j++)
    res += simd_end[j] >> 7;
  return res + length;
}

benchmark gist with the results

I'm not sure how well this generalizes to other ARM systems. also not sure if the loop unrolling at the end is actually useful in practice or if it just manipulates the benchmark.

lemire · 2023-05-15T15:49:17Z

@kvakil I think that @anonrig should adopt your approach. It is going to be hard to beat it, at least on an Apple laptop:

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

We will borrow this (with credit) for the simdutf library.

anonrig · 2023-05-15T15:50:27Z

@kvakil Your solution is 2 times faster than mine. Amazing.

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_BufferByteLength/noop/1                    0.410 ns        0.409 ns   1000000000 bytes_per_second=2.2744G/s
BM_BufferByteLength/compiler/1                0.544 ns        0.544 ns   1000000000 bytes_per_second=1.71059G/s
BM_BufferByteLength/compiler/8                 1.48 ns         1.48 ns    471510653 bytes_per_second=5.03215G/s
BM_BufferByteLength/compiler/64                5.53 ns         5.53 ns    126561681 bytes_per_second=10.7864G/s
BM_BufferByteLength/compiler/512               38.2 ns         38.2 ns     18339923 bytes_per_second=12.4724G/s
BM_BufferByteLength/compiler/4096               304 ns          304 ns      2306554 bytes_per_second=12.5513G/s
BM_BufferByteLength/compiler/32768             2402 ns         2402 ns       291926 bytes_per_second=12.7074G/s
BM_BufferByteLength/compiler/262144           19215 ns        19215 ns        36522 bytes_per_second=12.7059G/s
BM_BufferByteLength/compiler/1048576          76736 ns        76736 ns         9125 bytes_per_second=12.7262G/s
BM_BufferByteLength/mine/1                     1.55 ns         1.55 ns    449940865 bytes_per_second=613.437M/s
BM_BufferByteLength/mine/8                     2.25 ns         2.24 ns    315696426 bytes_per_second=3.32561G/s
BM_BufferByteLength/mine/64                    2.84 ns         2.83 ns    246636060 bytes_per_second=21.0259G/s
BM_BufferByteLength/mine/512                   13.2 ns         13.1 ns     47738231 bytes_per_second=36.349G/s
BM_BufferByteLength/mine/4096                   110 ns          109 ns      6415544 bytes_per_second=35.1572G/s
BM_BufferByteLength/mine/32768                  814 ns          813 ns       847560 bytes_per_second=37.5341G/s
BM_BufferByteLength/mine/262144                6564 ns         6538 ns       108111 bytes_per_second=37.3417G/s
BM_BufferByteLength/mine/1048576              26455 ns        26218 ns        26744 bytes_per_second=37.2475G/s
BM_BufferByteLength/mineNoPragma/1             2.17 ns         2.17 ns    323482520 bytes_per_second=439.842M/s
BM_BufferByteLength/mineNoPragma/8             1.60 ns         1.58 ns    446799004 bytes_per_second=4.7201G/s
BM_BufferByteLength/mineNoPragma/64            2.48 ns         2.48 ns    278893351 bytes_per_second=24.0302G/s
BM_BufferByteLength/mineNoPragma/512           15.6 ns         15.6 ns     44450653 bytes_per_second=30.4909G/s
BM_BufferByteLength/mineNoPragma/4096           125 ns          125 ns      5606638 bytes_per_second=30.4899G/s
BM_BufferByteLength/mineNoPragma/32768          963 ns          963 ns       727424 bytes_per_second=31.6868G/s
BM_BufferByteLength/mineNoPragma/262144        7672 ns         7672 ns        89608 bytes_per_second=31.8238G/s
BM_BufferByteLength/mineNoPragma/1048576      30665 ns        30663 ns        22734 bytes_per_second=31.8485G/s
BM_BufferByteLength/yagiz/1                    2.18 ns         2.18 ns    321632053 bytes_per_second=437.255M/s
BM_BufferByteLength/yagiz/8                    2.19 ns         2.19 ns    321373643 bytes_per_second=3.40666G/s
BM_BufferByteLength/yagiz/64                   2.25 ns         2.23 ns    320019018 bytes_per_second=26.7843G/s
BM_BufferByteLength/yagiz/512                  11.3 ns         11.3 ns     62082055 bytes_per_second=42.0978G/s
BM_BufferByteLength/yagiz/4096                  173 ns          173 ns      4064380 bytes_per_second=22.103G/s
BM_BufferByteLength/yagiz/32768                1854 ns         1853 ns       375639 bytes_per_second=16.4687G/s
BM_BufferByteLength/yagiz/262144              15233 ns        15233 ns        45953 bytes_per_second=16.0276G/s
BM_BufferByteLength/yagiz/1048576             61335 ns        61316 ns        11405 bytes_per_second=15.9268G/s

src/node_buffer.cc

lemire · 2023-05-15T16:56:46Z

src/node_buffer.cc

+  auto data = reinterpret_cast<const uint8_t*>(source.data);
+  auto length = source.length;
+
+  uint32_t result{0};


Because result is a 32-bit integer and we are overcounting by a factor of 4, there might be overflow if a string exceeds a gigabyte or so. If you use a 64-bit counter, then you are more likely to be fine.

V8 Fast API does not support returning uin64_t. Therefore, I don't think it will ever exceed the limit and overflow.

@anonrig I meant using a 64-bit counter internally. I did not mean that you should change the function signature. Suppose you have a 2 GB input string made of the character é. Then you'd have an overflow.

(This is nitpicking. Your function is correct, and will only fail in really huge strings where you'd have other problems anyhow...)

Maybe at least add an assertion?

No, you can’t say “This won’t overflow because we can’t use a larger type”.

What you can do is to use an uint64_t for result, then check before returning that it fits into an uint32_t. If you don’t want do to that, you could at least add something like static_assert(String::kMaxLength < std::numeric_limits<uint32_t>::max() / 8); (but I’m not sure if that assertion actually holds – the first approach here is definitely preferable).

lemire · 2023-05-15T22:40:55Z

@anonrig I think that there is a simpler approach that might be faster on some systems...

  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // vertical addition
    result -= vaddvq_s8(withhighbit);
  }

Check out my blog post.

bnoordhuis · 2023-05-16T09:19:50Z

General observation: I don't really want to see big chunks of ifdef'ed platform- or architecture-specific code in general-purpose source files. That kind of code tends to scare away new contributors.

Try to abstract it away so that people who don't know or care about neon or simd don't have to look at or even think about it.

lemire · 2023-05-16T13:08:16Z

Yet another way to do it...

  for (; data < simd_end; data += simd_lanes) {
    uint8x16_t chunk = vld1q_u8(data);
    uint8x16_t high_bits = vshrq_n_u8(chunk, 7);
    result += vaddvq_u8(high_bits);
  }

anonrig · 2023-05-17T00:07:42Z

Try to abstract it away so that people who don't know or care about neon or simd don't have to look at or even think about it.

@bnoordhuis I agree. I moved the implementation to node_simd.h

Co-authored-by: Keyhan Vakil <kvakil@sylph.kvakil.me> Co-authored-by: Daniel Lemire <daniel@lemire.me>

nodejs-github-bot · 2023-05-17T14:22:22Z

CI: https://ci.nodejs.org/job/node-test-pull-request/51847/

ronag

I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?

ronag

I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?

ronag · 2023-05-17T14:25:36Z

Does this need test or is it included in existing ones?

anonrig · 2023-08-16T15:36:31Z

Closing the pull request due to @lemire is pursuing to add latin1 support to simdutf.

nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels May 14, 2023

anonrig force-pushed the simd-bytelength branch from f3d3573 to cc25de4 Compare May 14, 2023 19:44

anonrig requested review from joyeecheung and addaleax May 14, 2023 19:45

anonrig marked this pull request as ready for review May 14, 2023 20:52

anonrig force-pushed the simd-bytelength branch from cc25de4 to 16c7187 Compare May 14, 2023 20:52

anonrig requested a review from ronag May 14, 2023 20:55

anonrig force-pushed the simd-bytelength branch from 16c7187 to 131275a Compare May 14, 2023 21:01

anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label May 14, 2023

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 14, 2023

github-actions bot mentioned this pull request May 15, 2023

CI Reliability 2023-05-15 nodejs/reliability#566

Open

17 tasks

kvakil reviewed May 15, 2023

View reviewed changes

kvakil requested changes May 15, 2023

View reviewed changes

src/node_buffer.cc Outdated Show resolved Hide resolved

src/node_buffer.cc Outdated Show resolved Hide resolved

src/node_buffer.cc Outdated Show resolved Hide resolved

src/node_buffer.cc Outdated Show resolved Hide resolved

anonrig force-pushed the simd-bytelength branch from 131275a to a52d532 Compare May 15, 2023 15:33

anonrig force-pushed the simd-bytelength branch 6 times, most recently from 2035fa0 to d0d1eb7 Compare May 15, 2023 16:40

addaleax reviewed May 15, 2023

View reviewed changes

src/node_buffer.cc Outdated Show resolved Hide resolved

src/node_buffer.cc Outdated Show resolved Hide resolved

lemire reviewed May 15, 2023

View reviewed changes

lemire approved these changes May 15, 2023

View reviewed changes

github-actions bot mentioned this pull request May 16, 2023

CI Reliability 2023-05-16 nodejs/reliability#567

Open

19 tasks

anonrig force-pushed the simd-bytelength branch from 0e2ef75 to 8e4c7dd Compare May 17, 2023 00:07

anonrig requested a review from bnoordhuis May 17, 2023 00:07

github-actions bot mentioned this pull request May 17, 2023

CI Reliability 2023-05-17 nodejs/reliability#568

Open

17 tasks

anonrig force-pushed the simd-bytelength branch 2 times, most recently from 9494cb7 to a29a70d Compare May 17, 2023 00:35

buffer: add SIMD Neon optimization for byteLength

d573ea1

Co-authored-by: Keyhan Vakil <kvakil@sylph.kvakil.me> Co-authored-by: Daniel Lemire <daniel@lemire.me>

anonrig force-pushed the simd-bytelength branch from a29a70d to d573ea1 Compare May 17, 2023 00:36

anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label May 17, 2023

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 17, 2023

ronag reviewed May 17, 2023

View reviewed changes

anonrig closed this Aug 16, 2023

anonrig deleted the simd-bytelength branch August 16, 2023 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer: add SIMD Neon optimization for `byteLength` #48009

buffer: add SIMD Neon optimization for `byteLength` #48009

anonrig commented May 14, 2023 •

edited

Loading

anonrig commented May 14, 2023

nodejs-github-bot commented May 14, 2023

nodejs-github-bot commented May 15, 2023

kvakil left a comment

lemire commented May 15, 2023

kvakil left a comment •

edited

Loading

kvakil commented May 15, 2023

lemire commented May 15, 2023

anonrig commented May 15, 2023

lemire May 15, 2023

anonrig May 15, 2023

lemire May 15, 2023

ronag May 16, 2023

addaleax May 16, 2023

lemire commented May 15, 2023

bnoordhuis commented May 16, 2023

lemire commented May 16, 2023

anonrig commented May 17, 2023

nodejs-github-bot commented May 17, 2023

ronag left a comment

ronag left a comment

ronag commented May 17, 2023

anonrig commented Aug 16, 2023

buffer: add SIMD Neon optimization for byteLength #48009

buffer: add SIMD Neon optimization for byteLength #48009

Conversation

anonrig commented May 14, 2023 • edited Loading

anonrig commented May 14, 2023

nodejs-github-bot commented May 14, 2023

nodejs-github-bot commented May 15, 2023

kvakil left a comment

Choose a reason for hiding this comment

lemire commented May 15, 2023

kvakil left a comment • edited Loading

Choose a reason for hiding this comment

kvakil commented May 15, 2023

lemire commented May 15, 2023

anonrig commented May 15, 2023

lemire May 15, 2023

Choose a reason for hiding this comment

anonrig May 15, 2023

Choose a reason for hiding this comment

lemire May 15, 2023

Choose a reason for hiding this comment

ronag May 16, 2023

Choose a reason for hiding this comment

addaleax May 16, 2023

Choose a reason for hiding this comment

lemire commented May 15, 2023

bnoordhuis commented May 16, 2023

lemire commented May 16, 2023

anonrig commented May 17, 2023

nodejs-github-bot commented May 17, 2023

ronag left a comment

Choose a reason for hiding this comment

ronag left a comment

Choose a reason for hiding this comment

ronag commented May 17, 2023

anonrig commented Aug 16, 2023

buffer: add SIMD Neon optimization for `byteLength` #48009

buffer: add SIMD Neon optimization for `byteLength` #48009

anonrig commented May 14, 2023 •

edited

Loading

kvakil left a comment •

edited

Loading