Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix simd accumulator overflow #1

Merged
merged 3 commits into from
Jul 16, 2023
Merged

fix simd accumulator overflow #1

merged 3 commits into from
Jul 16, 2023

Conversation

PeterFaiman
Copy link
Contributor

The intrinsics version (opt4_simd) has a bug:

// Flush accumulator every 256 iterations to avoid overflow
if block_i % (u8::MAX as usize + 1) == u8::MAX as usize {

The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane.

Flushing every 256 iterations isn't enough, because you can potentially encounter 256 's' in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:

fn main() {
    let input = "s".repeat(1024*1024);
    println!("{}", baseline(&input));
    println!("{}", opt4_simd(&input));
}

Produces:

1048576
-1040384

The fix, performance untested:

// Flush accumulator every 255 iterations to avoid overflow
if block_i % u8::MAX as usize == (u8::MAX - 1) as usize {

Correctly produces:

1048576
1048576

https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv

The intrinsics version (`opt4_simd`) has a bug:

>```rust
>// Flush accumulator every 256 iterations to avoid overflow
>if block_i % (u8::MAX as usize + 1) == u8::MAX as usize {
>```
>
>The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane.

Flushing every 256 iterations isn't enough, because you can potentially encounter 256 `'s'` in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:

```rust
fn main() {
    let input = "s".repeat(1024*1024);
    println!("{}", baseline(&input));
    println!("{}", opt4_simd(&input));
}
```

Produces:

```
1048576
-1040384
```

The fix, performance untested:

```rust
// Flush accumulator every 255 iterations to avoid overflow
if block_i % u8::MAX as usize == (u8::MAX - 1) as usize {
```

Correctly produces:

```
1048576
1048576
```

https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv
@PeterFaiman
Copy link
Contributor Author

PeterFaiman commented Jul 15, 2023

The new test I added also caused the chunked versions to overflow, so I committed an additional fix changing the chunk size to 255.

Unfortunately this regresses performance:

run_switches/opt6_chunk_exact_count
                        time:   [14.737 µs 14.740 µs 14.744 µs]
                        thrpt:  [63.167 GiB/s 63.182 GiB/s 63.196 GiB/s]
                 change:
                        time:   [+35.081% +35.355% +35.637%] (p = 0.00 < 0.05)
                        thrpt:  [-26.274% -26.120% -25.970%]
                        Performance has regressed.

Test machine: MacBook Air, 15-inch, M2, 2023.

@PeterFaiman
Copy link
Contributor Author

The fix also regresses the original fastest (10x unrolled), but not by as much:

run_switches/opt5_simd_unrolled_10x
                        time:   [25.888 µs 25.892 µs 25.898 µs]
                        thrpt:  [35.962 GiB/s 35.969 GiB/s 35.976 GiB/s]
                 change:
                        time:   [+5.4901% +5.7168% +5.9516%] (p = 0.00 < 0.05)
                        thrpt:  [-5.6173% -5.4077% -5.2044%]
                        Performance has regressed.

$ RUST_BACKTRACE=1 cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.03s
     Running unittests src/lib.rs (target/debug/deps/n_times_faster_than_c-39ece82b67ddb14c)

running 3 tests
test tests::test_simple ... ok
test tests::test_all_s ... FAILED
test tests::test_large ... ok

                        v_acc~I = vaddq_u8(v_acc~I, v_eq_s~I);
failures:

---- tests::test_all_s stdout ----
thread 'tests::test_all_s' panicked at 'attempt to add with overflow', /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:149:1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:117:5
   3: <u8 as core::iter::traits::accum::Sum>::sum::{{closure}}
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:53:28
   4: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:21
   5: core::iter::traits::iterator::Iterator::fold
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9
   7: <u8 as core::iter::traits::accum::Sum>::sum
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17
   8: core::iter::traits::iterator::Iterator::sum
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9
   9: n_times_faster_than_c::opt6_chunk_count::{{closure}}
             at ./src/lib.rs:125:22
  10: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:28
  11: core::iter::traits::iterator::Iterator::fold
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21
  12: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9
  13: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9
  14: <i64 as core::iter::traits::accum::Sum>::sum
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17
  15: core::iter::traits::iterator::Iterator::sum
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9
  16: n_times_faster_than_c::opt6_chunk_count
             at ./src/lib.rs:122:15
  17: n_times_faster_than_c::tests::test_all_s
             at ./src/lib.rs:195:9
  18: n_times_faster_than_c::tests::test_all_s::{{closure}}
             at ./src/lib.rs:192:21
  19: core::ops::function::FnOnce::call_once
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5
  20: core::ops::function::FnOnce::call_once
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

failures:
    tests::test_all_s

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.29s
            res = (2 * res) - n_simd_elems as i64;

error: test failed, to rerun pass `--lib`
@PeterFaiman
Copy link
Contributor Author

The large performance regression in the chunked versions was obviously because the chunks were no longer well-aligned. However the real solution is obvious: use something less than 256 that is well-aligned: 128. I amended my commit to use 128, recovering some of the losses. I then amended my commit again to use 192, which recovers all of the losses for opt6_chunk_exact_count, and most for opt6_chunk_count. I tried going higher to 224 (256-32) and 240 (256-16), but these were both worse than 192. The final difference between 256 and 192, as measured on my machine:

run_switches/opt6_chunk_count
                        time:   [13.791 µs 13.793 µs 13.796 µs]
                        thrpt:  [67.507 GiB/s 67.519 GiB/s 67.529 GiB/s]
                 change:
                        time:   [+14.869% +15.136% +15.373%] (p = 0.00 < 0.05)
                        thrpt:  [-13.325% -13.146% -12.945%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
run_switches/opt6_chunk_exact_count
                        time:   [10.885 µs 10.886 µs 10.888 µs]
                        thrpt:  [85.536 GiB/s 85.550 GiB/s 85.562 GiB/s]
                 change:
                        time:   [+0.2330% +0.4504% +0.6870%] (p = 0.00 < 0.05)
                        thrpt:  [-0.6823% -0.4484% -0.2324%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

Flushing every 255 iterations regressed performance by ~5%, but this
eliminates the regression.

run_switches/opt5_simd_unrolled_10x
                        time:   [24.636 µs 24.641 µs 24.647 µs]
                        thrpt:  [37.786 GiB/s 37.795 GiB/s 37.804 GiB/s]
                 change:
                        time:   [+0.2771% +0.5216% +0.7572%] (p = 0.00 < 0.05)
                        thrpt:  [-0.7516% -0.5189% -0.2763%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
@PeterFaiman
Copy link
Contributor Author

Flushing the accumulator every 128 iterations (vs 255) in the intrinsics versions reversed the regressions there as well.

@tommyip
Copy link
Owner

tommyip commented Jul 16, 2023

Good catch! This only reduces the best version by 0.5% which is good news. Do you know why flushing at unaligned interval for the simd intrinsics version causes such a large slow down?

@tommyip tommyip merged commit c85c899 into tommyip:master Jul 16, 2023
@PeterFaiman
Copy link
Contributor Author

Not sure. Looking at the simd intrinsics version without unrolling, doing a mod by 255 adds a lot of instructions to the inner loop:

 LBB2_17:
-        add     x10, x10, #1
         add     x2, x2, #16
-        cmp     x9, x10
-        b.eq    LBB2_4
+        sub     x11, x11, #1
+        add     x10, x10, #1
+        sub     x9, x9, #1
+        cbz     x9, LBB2_4
 LBB2_21:
+        umulh   x13, x10, x12
+        lsr     x13, x13, #7
+        sub     x14, x11, x13
+        add     x13, x14, x13, lsl #8
         ldr     q2, [x0, x2]
         and.16b v2, v2, v1
         add.16b v0, v2, v0
-        mvn     w11, w10
-        tst     x11, #0x7f
-        b.ne    LBB2_17
+        cbnz    x13, LBB2_17

I'd have thought it wouldn't matter as much for the 10x version, but I didn't expand the macro to look at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants