fix simd accumulator overflow #1

PeterFaiman · 2023-07-15T09:30:31Z

The intrinsics version (opt4_simd) has a bug:

// Flush accumulator every 256 iterations to avoid overflow
if block_i % (u8::MAX as usize + 1) == u8::MAX as usize {
The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane.

Flushing every 256 iterations isn't enough, because you can potentially encounter 256 's' in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:

fn main() {
    let input = "s".repeat(1024*1024);
    println!("{}", baseline(&input));
    println!("{}", opt4_simd(&input));
}

Produces:

1048576
-1040384

The fix, performance untested:

// Flush accumulator every 255 iterations to avoid overflow
if block_i % u8::MAX as usize == (u8::MAX - 1) as usize {

Correctly produces:

1048576
1048576

https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv

The intrinsics version (`opt4_simd`) has a bug: >```rust >// Flush accumulator every 256 iterations to avoid overflow >if block_i % (u8::MAX as usize + 1) == u8::MAX as usize { >``` > >The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane. Flushing every 256 iterations isn't enough, because you can potentially encounter 256 `'s'` in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article: ```rust fn main() { let input = "s".repeat(1024*1024); println!("{}", baseline(&input)); println!("{}", opt4_simd(&input)); } ``` Produces: ``` 1048576 -1040384 ``` The fix, performance untested: ```rust // Flush accumulator every 255 iterations to avoid overflow if block_i % u8::MAX as usize == (u8::MAX - 1) as usize { ``` Correctly produces: ``` 1048576 1048576 ``` https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv

PeterFaiman · 2023-07-15T09:46:29Z

The new test I added also caused the chunked versions to overflow, so I committed an additional fix changing the chunk size to 255.

Unfortunately this regresses performance:

run_switches/opt6_chunk_exact_count
                        time:   [14.737 µs 14.740 µs 14.744 µs]
                        thrpt:  [63.167 GiB/s 63.182 GiB/s 63.196 GiB/s]
                 change:
                        time:   [+35.081% +35.355% +35.637%] (p = 0.00 < 0.05)
                        thrpt:  [-26.274% -26.120% -25.970%]
                        Performance has regressed.

Test machine: MacBook Air, 15-inch, M2, 2023.

PeterFaiman · 2023-07-15T10:09:37Z

The fix also regresses the original fastest (10x unrolled), but not by as much:

run_switches/opt5_simd_unrolled_10x
                        time:   [25.888 µs 25.892 µs 25.898 µs]
                        thrpt:  [35.962 GiB/s 35.969 GiB/s 35.976 GiB/s]
                 change:
                        time:   [+5.4901% +5.7168% +5.9516%] (p = 0.00 < 0.05)
                        thrpt:  [-5.6173% -5.4077% -5.2044%]
                        Performance has regressed.

$ RUST_BACKTRACE=1 cargo test Finished test [unoptimized + debuginfo] target(s) in 0.03s Running unittests src/lib.rs (target/debug/deps/n_times_faster_than_c-39ece82b67ddb14c) running 3 tests test tests::test_simple ... ok test tests::test_all_s ... FAILED test tests::test_large ... ok v_acc~I = vaddq_u8(v_acc~I, v_eq_s~I); failures: ---- tests::test_all_s stdout ---- thread 'tests::test_all_s' panicked at 'attempt to add with overflow', /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:149:1 stack backtrace: 0: rust_begin_unwind at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5 1: core::panicking::panic_fmt at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14 2: core::panicking::panic at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:117:5 3: <u8 as core::iter::traits::accum::Sum>::sum::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:53:28 4: core::iter::adapters::map::map_fold::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:21 5: core::iter::traits::iterator::Iterator::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21 6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 7: <u8 as core::iter::traits::accum::Sum>::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17 8: core::iter::traits::iterator::Iterator::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9 9: n_times_faster_than_c::opt6_chunk_count::{{closure}} at ./src/lib.rs:125:22 10: core::iter::adapters::map::map_fold::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:28 11: core::iter::traits::iterator::Iterator::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21 12: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 13: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 14: <i64 as core::iter::traits::accum::Sum>::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17 15: core::iter::traits::iterator::Iterator::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9 16: n_times_faster_than_c::opt6_chunk_count at ./src/lib.rs:122:15 17: n_times_faster_than_c::tests::test_all_s at ./src/lib.rs:195:9 18: n_times_faster_than_c::tests::test_all_s::{{closure}} at ./src/lib.rs:192:21 19: core::ops::function::FnOnce::call_once at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5 20: core::ops::function::FnOnce::call_once at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. failures: tests::test_all_s test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.29s res = (2 * res) - n_simd_elems as i64; error: test failed, to rerun pass `--lib`

PeterFaiman · 2023-07-15T11:11:47Z

The large performance regression in the chunked versions was obviously because the chunks were no longer well-aligned. However the real solution is obvious: use something less than 256 that is well-aligned: 128. I amended my commit to use 128, recovering some of the losses. I then amended my commit again to use 192, which recovers all of the losses for opt6_chunk_exact_count, and most for opt6_chunk_count. I tried going higher to 224 (256-32) and 240 (256-16), but these were both worse than 192. The final difference between 256 and 192, as measured on my machine:

run_switches/opt6_chunk_count
                        time:   [13.791 µs 13.793 µs 13.796 µs]
                        thrpt:  [67.507 GiB/s 67.519 GiB/s 67.529 GiB/s]
                 change:
                        time:   [+14.869% +15.136% +15.373%] (p = 0.00 < 0.05)
                        thrpt:  [-13.325% -13.146% -12.945%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
run_switches/opt6_chunk_exact_count
                        time:   [10.885 µs 10.886 µs 10.888 µs]
                        thrpt:  [85.536 GiB/s 85.550 GiB/s 85.562 GiB/s]
                 change:
                        time:   [+0.2330% +0.4504% +0.6870%] (p = 0.00 < 0.05)
                        thrpt:  [-0.6823% -0.4484% -0.2324%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

Flushing every 255 iterations regressed performance by ~5%, but this eliminates the regression. run_switches/opt5_simd_unrolled_10x time: [24.636 µs 24.641 µs 24.647 µs] thrpt: [37.786 GiB/s 37.795 GiB/s 37.804 GiB/s] change: time: [+0.2771% +0.5216% +0.7572%] (p = 0.00 < 0.05) thrpt: [-0.7516% -0.5189% -0.2763%] Change within noise threshold. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe

PeterFaiman · 2023-07-15T11:23:40Z

Flushing the accumulator every 128 iterations (vs 255) in the intrinsics versions reversed the regressions there as well.

tommyip · 2023-07-16T16:57:05Z

Good catch! This only reduces the best version by 0.5% which is good news. Do you know why flushing at unaligned interval for the simd intrinsics version causes such a large slow down?

PeterFaiman · 2023-07-17T08:39:49Z

Not sure. Looking at the simd intrinsics version without unrolling, doing a mod by 255 adds a lot of instructions to the inner loop:

 LBB2_17:
-        add     x10, x10, #1
         add     x2, x2, #16
-        cmp     x9, x10
-        b.eq    LBB2_4
+        sub     x11, x11, #1
+        add     x10, x10, #1
+        sub     x9, x9, #1
+        cbz     x9, LBB2_4
 LBB2_21:
+        umulh   x13, x10, x12
+        lsr     x13, x13, #7
+        sub     x14, x11, x13
+        add     x13, x14, x13, lsl #8
         ldr     q2, [x0, x2]
         and.16b v2, v2, v1
         add.16b v0, v2, v0
-        mvn     w11, w10
-        tst     x11, #0x7f
-        b.ne    LBB2_17
+        cbnz    x13, LBB2_17

I'd have thought it wouldn't matter as much for the 10x version, but I didn't expand the macro to look at that.

tommyip merged commit c85c899 into tommyip:master Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix simd accumulator overflow #1

fix simd accumulator overflow #1

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023 •

edited

Loading

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023

tommyip commented Jul 16, 2023

PeterFaiman commented Jul 17, 2023

fix simd accumulator overflow #1

fix simd accumulator overflow #1

Conversation

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023 • edited Loading

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023

PeterFaiman commented Jul 15, 2023

tommyip commented Jul 16, 2023

PeterFaiman commented Jul 17, 2023

PeterFaiman commented Jul 15, 2023 •

edited

Loading