Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: add SIMD Neon optimization for byteLength #48009

Closed
wants to merge 1 commit into from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented May 14, 2023

Ref nodejs/performance#52

Since, benchmark CI does not have ARM processor, here's the benchmark results from my local:

node benchmark/compare.js --old ./node-main --new ./out/Release/node --filter buffer-bytelength buffers > bytelength.csv && node-benchmark-compare bytelength.csv
[00:18:29|% 100| 2/2 files | 60/60 runs | 32/32 configs]: Done
                                                                                               confidence improvement accuracy (*)   (**)  (***)
buffers/buffer-bytelength-buffer.js n=4000000 len=16                                                          0.39 %       ±0.77% ±1.03% ±1.35%
buffers/buffer-bytelength-buffer.js n=4000000 len=2                                                          -0.02 %       ±0.45% ±0.60% ±0.78%
buffers/buffer-bytelength-buffer.js n=4000000 len=256                                                         1.97 %       ±3.41% ±4.60% ±6.10%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='four_bytes'                    1.71 %       ±3.63% ±4.84% ±6.31%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='one_byte'                      0.03 %       ±0.68% ±0.90% ±1.17%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='three_bytes'                   2.16 %       ±2.97% ±3.98% ±5.24%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='base64' type='two_bytes'                    -1.08 %       ±2.05% ±2.76% ±3.65%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='four_bytes'                     -0.13 %       ±0.96% ±1.28% ±1.67%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='one_byte'               ***      4.59 %       ±0.77% ±1.02% ±1.33%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='three_bytes'                     0.21 %       ±0.61% ±0.82% ±1.07%
buffers/buffer-bytelength-string.js n=1000000 repeat=1 encoding='utf8' type='two_bytes'                      -0.30 %       ±0.43% ±0.57% ±0.74%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='four_bytes'                  -0.68 %       ±3.18% ±4.24% ±5.53%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='one_byte'                     0.65 %       ±2.05% ±2.75% ±3.61%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='three_bytes'                 -0.89 %       ±1.98% ±2.66% ±3.51%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='base64' type='two_bytes'                    0.10 %       ±1.44% ±1.92% ±2.50%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='four_bytes'                    -1.39 %       ±2.75% ±3.70% ±4.91%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='one_byte'                      -0.97 %       ±1.35% ±1.82% ±2.40%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='three_bytes'                    0.94 %       ±1.67% ±2.25% ±2.98%
buffers/buffer-bytelength-string.js n=1000000 repeat=16 encoding='utf8' type='two_bytes'                      0.19 %       ±0.47% ±0.63% ±0.83%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='four_bytes'                   -1.67 %       ±2.60% ±3.50% ±4.63%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='one_byte'                      0.77 %       ±1.95% ±2.63% ±3.48%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='three_bytes'                   2.70 %       ±3.23% ±4.34% ±5.73%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='base64' type='two_bytes'                     0.02 %       ±1.06% ±1.41% ±1.83%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='four_bytes'                      0.24 %       ±0.61% ±0.81% ±1.06%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='one_byte'                        0.04 %       ±2.03% ±2.70% ±3.53%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='three_bytes'                     1.33 %       ±2.73% ±3.67% ±4.87%
buffers/buffer-bytelength-string.js n=1000000 repeat=2 encoding='utf8' type='two_bytes'                       0.10 %       ±0.40% ±0.53% ±0.69%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='four_bytes'                 -0.16 %       ±3.65% ±4.85% ±6.32%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='one_byte'                    0.47 %       ±2.66% ±3.54% ±4.60%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='three_bytes'                -0.51 %       ±0.78% ±1.04% ±1.37%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='base64' type='two_bytes'                  -1.66 %       ±2.60% ±3.49% ±4.62%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='four_bytes'                   -0.42 %       ±0.54% ±0.72% ±0.94%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='one_byte'                     -0.07 %       ±0.32% ±0.43% ±0.56%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='three_bytes'                  -0.49 %       ±0.69% ±0.93% ±1.21%
buffers/buffer-bytelength-string.js n=1000000 repeat=256 encoding='utf8' type='two_bytes'              *     -0.53 %       ±0.49% ±0.65% ±0.85%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 35 comparisons, you can thus expect the following amount of false-positive results:
  1.75 false positives, when considering a   5% risk acceptance (*, **, ***),
  0.35 false positives, when considering a   1% risk acceptance (**, ***),
  0.04 false positives, when considering a 0.1% risk acceptance (***)

@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels May 14, 2023
@anonrig
Copy link
Member Author

anonrig commented May 14, 2023

cc @nodejs/cpp-reviewers

@anonrig anonrig marked this pull request as ready for review May 14, 2023 20:52
@anonrig anonrig requested a review from ronag May 14, 2023 20:55
@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label May 14, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 14, 2023
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

Copy link
Contributor

@kvakil kvakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my ARM system this is already correctly auto-vectorized in the current node release:

$ objdump -d --disassemble-symbols='__ZN4node6Buffer12_GLOBAL__N_118FastByteLengthUtf8EN2v85LocalINS2_5ValueEEERKNS2_17FastOneByteStringE' node-v20.1.0-darwin-arm64/bin/node
...
1000abae4: 74 d5 7f ad  ldp     q20, q21, [x11, #-16]
1000abae8: 94 06 09 6f  ushr.16b        v20, v20, #7
1000abaec: b5 06 09 6f  ushr.16b        v21, v21, #7
1000abaf0: 96 02 01 4e  tbl.16b v22, { v20 }, v1
...

Is auto-vectorization not happening for the other builds? If so, can we figure out why? I think it would be nicer than hand-coding SIMD equivalents for each system.

@lemire
Copy link
Member

lemire commented May 15, 2023

@kvakil Autovec. is happening but it is unclear whether it is producing highly optimized code. What is your sentiment?

Copy link
Contributor

@kvakil kvakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lemire: the produced auto-vectorized code looks largely fine to me. (edit: of course, I am sure we can do better.)

Higher values of repeat don't exercise this code path since at repeat != 1, V8 creates cons strings, not seq strings. It would be nice to actually benchmark with larger strings by creating a flat string.

Requesting changes for the out-of-bounds read & correctness problems.

src/node_buffer.cc Outdated Show resolved Hide resolved
src/node_buffer.cc Outdated Show resolved Hide resolved
src/node_buffer.cc Outdated Show resolved Hide resolved
src/node_buffer.cc Outdated Show resolved Hide resolved
@kvakil
Copy link
Contributor

kvakil commented May 15, 2023

the fastest I was able to get on my system was using shrn+popcount. On my system it's ~3x faster than the compiler's version for large sizes in a tight loop.

#include <arm_neon.h>
uint64_t mine(const uint8_t *data, uint32_t length) {
  uint64_t res = 0;
  const int lanes = 16;
  uint8_t rem = length % lanes;
  const auto *simd_end = data + (length / lanes) * lanes;
  const auto threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes)
    res += __builtin_popcountll(vget_lane_u64(
        vreinterpret_u64_u8(
            vshrn_n_u16(vreinterpretq_u16_u8(
                            vcgeq_u8(vld1q_u8(data), threshold)),
                        4)),
        0));

  res >>= 2;

  // This unrolling is a little greedy & I would probably not do it.
  if (rem >= lanes) __builtin_unreachable();
  #pragma clang loop unroll_count(16)
  for (uint8_t j = 0; j < rem; j++)
    res += simd_end[j] >> 7;
  return res + length;
}

benchmark gist with the results

I'm not sure how well this generalizes to other ARM systems. also not sure if the loop unrolling at the end is actually useful in practice or if it just manipulates the benchmark.

@lemire
Copy link
Member

lemire commented May 15, 2023

@kvakil I think that @anonrig should adopt your approach. It is going to be hard to beat it, at least on an Apple laptop:

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

We will borrow this (with credit) for the simdutf library.

@anonrig
Copy link
Member Author

anonrig commented May 15, 2023

@kvakil Your solution is 2 times faster than mine. Amazing.

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_BufferByteLength/noop/1                    0.410 ns        0.409 ns   1000000000 bytes_per_second=2.2744G/s
BM_BufferByteLength/compiler/1                0.544 ns        0.544 ns   1000000000 bytes_per_second=1.71059G/s
BM_BufferByteLength/compiler/8                 1.48 ns         1.48 ns    471510653 bytes_per_second=5.03215G/s
BM_BufferByteLength/compiler/64                5.53 ns         5.53 ns    126561681 bytes_per_second=10.7864G/s
BM_BufferByteLength/compiler/512               38.2 ns         38.2 ns     18339923 bytes_per_second=12.4724G/s
BM_BufferByteLength/compiler/4096               304 ns          304 ns      2306554 bytes_per_second=12.5513G/s
BM_BufferByteLength/compiler/32768             2402 ns         2402 ns       291926 bytes_per_second=12.7074G/s
BM_BufferByteLength/compiler/262144           19215 ns        19215 ns        36522 bytes_per_second=12.7059G/s
BM_BufferByteLength/compiler/1048576          76736 ns        76736 ns         9125 bytes_per_second=12.7262G/s
BM_BufferByteLength/mine/1                     1.55 ns         1.55 ns    449940865 bytes_per_second=613.437M/s
BM_BufferByteLength/mine/8                     2.25 ns         2.24 ns    315696426 bytes_per_second=3.32561G/s
BM_BufferByteLength/mine/64                    2.84 ns         2.83 ns    246636060 bytes_per_second=21.0259G/s
BM_BufferByteLength/mine/512                   13.2 ns         13.1 ns     47738231 bytes_per_second=36.349G/s
BM_BufferByteLength/mine/4096                   110 ns          109 ns      6415544 bytes_per_second=35.1572G/s
BM_BufferByteLength/mine/32768                  814 ns          813 ns       847560 bytes_per_second=37.5341G/s
BM_BufferByteLength/mine/262144                6564 ns         6538 ns       108111 bytes_per_second=37.3417G/s
BM_BufferByteLength/mine/1048576              26455 ns        26218 ns        26744 bytes_per_second=37.2475G/s
BM_BufferByteLength/mineNoPragma/1             2.17 ns         2.17 ns    323482520 bytes_per_second=439.842M/s
BM_BufferByteLength/mineNoPragma/8             1.60 ns         1.58 ns    446799004 bytes_per_second=4.7201G/s
BM_BufferByteLength/mineNoPragma/64            2.48 ns         2.48 ns    278893351 bytes_per_second=24.0302G/s
BM_BufferByteLength/mineNoPragma/512           15.6 ns         15.6 ns     44450653 bytes_per_second=30.4909G/s
BM_BufferByteLength/mineNoPragma/4096           125 ns          125 ns      5606638 bytes_per_second=30.4899G/s
BM_BufferByteLength/mineNoPragma/32768          963 ns          963 ns       727424 bytes_per_second=31.6868G/s
BM_BufferByteLength/mineNoPragma/262144        7672 ns         7672 ns        89608 bytes_per_second=31.8238G/s
BM_BufferByteLength/mineNoPragma/1048576      30665 ns        30663 ns        22734 bytes_per_second=31.8485G/s
BM_BufferByteLength/yagiz/1                    2.18 ns         2.18 ns    321632053 bytes_per_second=437.255M/s
BM_BufferByteLength/yagiz/8                    2.19 ns         2.19 ns    321373643 bytes_per_second=3.40666G/s
BM_BufferByteLength/yagiz/64                   2.25 ns         2.23 ns    320019018 bytes_per_second=26.7843G/s
BM_BufferByteLength/yagiz/512                  11.3 ns         11.3 ns     62082055 bytes_per_second=42.0978G/s
BM_BufferByteLength/yagiz/4096                  173 ns          173 ns      4064380 bytes_per_second=22.103G/s
BM_BufferByteLength/yagiz/32768                1854 ns         1853 ns       375639 bytes_per_second=16.4687G/s
BM_BufferByteLength/yagiz/262144              15233 ns        15233 ns        45953 bytes_per_second=16.0276G/s
BM_BufferByteLength/yagiz/1048576             61335 ns        61316 ns        11405 bytes_per_second=15.9268G/s

@anonrig anonrig force-pushed the simd-bytelength branch 6 times, most recently from 2035fa0 to d0d1eb7 Compare May 15, 2023 16:40
src/node_buffer.cc Outdated Show resolved Hide resolved
src/node_buffer.cc Outdated Show resolved Hide resolved
auto data = reinterpret_cast<const uint8_t*>(source.data);
auto length = source.length;

uint32_t result{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because result is a 32-bit integer and we are overcounting by a factor of 4, there might be overflow if a string exceeds a gigabyte or so. If you use a 64-bit counter, then you are more likely to be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V8 Fast API does not support returning uin64_t. Therefore, I don't think it will ever exceed the limit and overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anonrig I meant using a 64-bit counter internally. I did not mean that you should change the function signature. Suppose you have a 2 GB input string made of the character é. Then you'd have an overflow.

(This is nitpicking. Your function is correct, and will only fail in really huge strings where you'd have other problems anyhow...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at least add an assertion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can’t say “This won’t overflow because we can’t use a larger type”.

What you can do is to use an uint64_t for result, then check before returning that it fits into an uint32_t. If you don’t want do to that, you could at least add something like static_assert(String::kMaxLength < std::numeric_limits<uint32_t>::max() / 8); (but I’m not sure if that assertion actually holds – the first approach here is definitely preferable).

@lemire
Copy link
Member

lemire commented May 15, 2023

@anonrig I think that there is a simpler approach that might be faster on some systems...

  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // vertical addition
    result -= vaddvq_s8(withhighbit);
  }

Check out my blog post.

@bnoordhuis
Copy link
Member

General observation: I don't really want to see big chunks of ifdef'ed platform- or architecture-specific code in general-purpose source files. That kind of code tends to scare away new contributors.

Try to abstract it away so that people who don't know or care about neon or simd don't have to look at or even think about it.

@lemire
Copy link
Member

lemire commented May 16, 2023

Yet another way to do it...

  for (; data < simd_end; data += simd_lanes) {
    uint8x16_t chunk = vld1q_u8(data);
    uint8x16_t high_bits = vshrq_n_u8(chunk, 7);
    result += vaddvq_u8(high_bits);
  }

@anonrig
Copy link
Member Author

anonrig commented May 17, 2023

Try to abstract it away so that people who don't know or care about neon or simd don't have to look at or even think about it.

@bnoordhuis I agree. I moved the implementation to node_simd.h

@anonrig anonrig requested a review from bnoordhuis May 17, 2023 00:07
@anonrig anonrig force-pushed the simd-bytelength branch 2 times, most recently from 9494cb7 to a29a70d Compare May 17, 2023 00:35
Co-authored-by: Keyhan Vakil <kvakil@sylph.kvakil.me>
Co-authored-by: Daniel Lemire <daniel@lemire.me>
@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label May 17, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 17, 2023
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@ronag ronag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?

Copy link
Member

@ronag ronag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind a few comments in the code :D. Whats unroll and how is it differnet from simd?

@ronag
Copy link
Member

ronag commented May 17, 2023

Does this need test or is it included in existing ones?

@anonrig
Copy link
Member Author

anonrig commented Aug 16, 2023

Closing the pull request due to @lemire is pursuing to add latin1 support to simdutf.

@anonrig anonrig closed this Aug 16, 2023
@anonrig anonrig deleted the simd-bytelength branch August 16, 2023 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants