Inline the fast path of BitWriter::send_bits #258

brian-pane · 2024-12-08T04:55:22Z

The "equal" and "greater" branches yield the same result when total_bits == Self::BIT_BUF_SIZE, so this change combines them to eliminate one conditional branch whenever the bit_buffer is full.

Benchmark results before:

hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      92.2 ms ±   1.2 ms    [User: 79.9 ms, System: 12.3 ms]
  Range (min … max):    90.1 ms …  96.6 ms    100 runs

and with this change applied:

hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      91.3 ms ±   0.9 ms    [User: 79.0 ms, System: 12.2 ms]
  Range (min … max):    89.7 ms …  95.2 ms    100 runs

The "equal" and "greater" branches yield the same result when total_bits == Self::BIT_BUF_SIZE, so this change combines them to eliminate one conditional branch whenever the bit_buffer is full. Benchmark results before: ``` hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar" Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar Time (mean ± σ): 92.2 ms ± 1.2 ms [User: 79.9 ms, System: 12.3 ms] Range (min … max): 90.1 ms … 96.6 ms 100 runs ``` and with this change applied: ``` hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar" Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar Time (mean ± σ): 91.3 ms ± 0.9 ms [User: 79.0 ms, System: 12.2 ms] Range (min … max): 89.7 ms … 95.2 ms 100 runs ```

folkertdev · 2024-12-08T10:25:59Z

Hi,

For benchmarking it is crucial that you actually race the two implementations against each other: just comparing wall clock times is unreliable, especially when the times are so close together (and in fact their ranges overlap).

When I run this commit agains its base commit, I actually observe a regression

Benchmark 1 (65 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          77.2ms ±  844us    76.2ms … 82.6ms          4 ( 6%)        0%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          283M  ± 2.02M      279M  …  292M           4 ( 6%)        0%
  instructions        633M  ±  277       633M  …  633M           0 ( 0%)        0%
  cache_references   20.0M  ±  162K     19.7M  … 20.7M           2 ( 3%)        0%
  cache_misses        466K  ± 67.0K      374K  …  735K           2 ( 3%)        0%
  branch_misses      3.06M  ± 7.05K     3.05M  … 3.09M           1 ( 2%)        0%
Benchmark 2 (64 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          79.1ms ±  422us    78.5ms … 80.3ms          2 ( 3%)        💩+  2.5% ±  0.3%
  peak_rss           26.7MB ± 68.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          292M  ± 1.31M      290M  …  297M           2 ( 3%)        💩+  3.3% ±  0.2%
  instructions        659M  ±  300       659M  …  659M           1 ( 2%)        💩+  4.1% ±  0.0%
  cache_references   19.9M  ±  190K     19.6M  … 20.5M           1 ( 2%)          -  0.8% ±  0.3%
  cache_misses        455K  ± 79.0K      353K  …  723K           1 ( 2%)          -  2.3% ±  5.4%
  branch_misses      3.11M  ± 2.09K     3.11M  … 3.12M           0 ( 0%)        💩+  1.6% ±  0.1%

(I'm using https://github.com/andrewrk/poop here)

The branch that you remove is not actually conditional in practice: it is very well-predicted. While it doesn't look like it, this code is actually fairly optimized and it's hard to do better (I've tried!).

So, I guess, see what you get when you run the two benchmarks alongside each other (hyperfine accepts multiple things to benchmark, you just need to copy the binary from before your change somewhere so you can use it after applying your change).

brian-pane · 2024-12-08T17:12:37Z

Thanks, I'll adopt the same benchmark method. In my initial test run using the same benchmark driver, I see an improvement in branch mispredictions, although it's offset by an increase in instructions executed:

poop "./compress-baseline 1 rs silesia-small.tar" "./target/release/examples/compress 1 rs silesia-small.tar" 
Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.6ms ± 1.51ms    90.3ms …  101ms          4 ( 7%)        0%
  peak_rss           26.7MB ± 67.0KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          341M  ±  706K      340M  …  343M           0 ( 0%)        0%
  instructions        748M  ±  234       748M  …  748M           0 ( 0%)        0%
  cache_references    399K  ± 2.02K      396K  …  406K           3 ( 5%)        0%
  cache_misses        294K  ± 8.94K      266K  …  313K           6 (11%)        0%
  branch_misses      3.28M  ± 4.31K     3.27M  … 3.29M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          90.5ms ±  541us    88.6ms … 92.0ms          6 (11%)          -  1.2% ±  0.5%
  peak_rss           26.7MB ± 64.8KB    26.5MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          339M  ±  578K      338M  …  341M           0 ( 0%)          -  0.6% ±  0.1%
  instructions        771M  ±  257       771M  …  771M           0 ( 0%)        💩+  3.1% ±  0.0%
  cache_references    398K  ± 2.55K      396K  …  412K           7 (13%)          -  0.1% ±  0.2%
  cache_misses        293K  ± 9.10K      266K  …  304K           9 (16%)          -  0.5% ±  1.2%
  branch_misses      3.20M  ± 10.6K     3.18M  … 3.22M           0 ( 0%)        ⚡-  2.4% ±  0.1%

I suspect the difference in branch mispredictions compared to your results might be due to running on different CPU models. But I haven't yet figured out why my patch causes an increase in instructions. I'll look at it more later.

Benchmark at compression level 1: ``` Benchmark 1 (60 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 83.5ms ± 2.24ms 81.0ms … 89.9ms 8 (13%) 0% peak_rss 26.7MB ± 65.2KB 26.6MB … 26.7MB 0 ( 0%) 0% cpu_cycles 298M ± 1.48M 297M … 307M 4 ( 7%) 0% instructions 645M ± 270 645M … 645M 0 ( 0%) 0% cache_references 403K ± 10.9K 396K … 462K 5 ( 8%) 0% cache_misses 301K ± 10.3K 262K … 319K 3 ( 5%) 0% branch_misses 3.06M ± 9.56K 3.04M … 3.08M 0 ( 0%) 0% Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 81.8ms ± 1.57ms 78.1ms … 88.2ms 1 ( 2%) ⚡- 2.0% ± 0.8% peak_rss 26.7MB ± 73.8KB 26.5MB … 26.7MB 0 ( 0%) - 0.1% ± 0.1% cpu_cycles 293M ± 2.67M 292M … 313M 2 ( 3%) ⚡- 1.7% ± 0.3% instructions 612M ± 280 612M … 612M 0 ( 0%) ⚡- 5.1% ± 0.0% cache_references 402K ± 11.6K 394K … 472K 7 (11%) - 0.1% ± 1.0% cache_misses 304K ± 6.26K 287K … 325K 1 ( 2%) + 1.2% ± 1.0% branch_misses 3.05M ± 3.70K 3.04M … 3.06M 0 ( 0%) - 0.3% ± 0.1% ``` The improvement decreases as the compression level increases. For example, at level 3, ``` Benchmark 1 (34 runs): ./compress-baseline 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 150ms ± 3.71ms 147ms … 169ms 3 ( 9%) 0% peak_rss 24.7MB ± 73.0KB 24.5MB … 24.8MB 0 ( 0%) 0% cpu_cycles 596M ± 4.91M 594M … 619M 3 ( 9%) 0% instructions 1.47G ± 340 1.47G … 1.47G 1 ( 3%) 0% cache_references 377K ± 6.86K 371K … 408K 4 (12%) 0% cache_misses 287K ± 4.20K 277K … 295K 3 ( 9%) 0% branch_misses 7.72M ± 7.17K 7.71M … 7.74M 0 ( 0%) 0% Benchmark 2 (34 runs): ./target/release/examples/compress 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 150ms ± 1.42ms 148ms … 156ms 1 ( 3%) + 0.4% ± 0.9% peak_rss 24.7MB ± 62.2KB 24.6MB … 24.8MB 0 ( 0%) + 0.0% ± 0.1% cpu_cycles 601M ± 4.16M 598M … 624M 1 ( 3%) + 0.7% ± 0.4% instructions 1.46G ± 270 1.46G … 1.46G 0 ( 0%) ⚡- 1.1% ± 0.0% cache_references 376K ± 4.08K 371K … 388K 1 ( 3%) - 0.3% ± 0.7% cache_misses 290K ± 5.37K 275K … 297K 5 (15%) + 0.9% ± 0.8% branch_misses 7.70M ± 5.88K 7.69M … 7.71M 0 ( 0%) - 0.3% ± 0.0% ```

brian-pane · 2024-12-12T21:31:58Z

I tried a different approach in 76b6c14: keeping the original logic, but inlining the fast path. It seems to work on my test system.

folkertdev · 2024-12-12T22:23:52Z

zlib-rs/src/deflate.rs

+        }
+    }
+
+    fn send_bits_overflow(&mut self, val: u64, total_bits: u8) {


you can try if adding #[cold] here makes a difference. It doesn't really for me. Otherwise this looks good!

brian-pane · 2024-12-12T22:30:34Z

I just tried with #[cold] and the results look basically unchanged on my system.

brian-pane and others added 2 commits December 12, 2024 11:14

Merge branch 'trifectatechfoundation:main' into deflate-perf

6639d2f

brian-pane changed the title ~~Small speedup by simplifying BitWriter::send_bits~~ Inline the fast path of BitWriter::send_bits Dec 12, 2024

folkertdev reviewed Dec 12, 2024

View reviewed changes

folkertdev approved these changes Dec 12, 2024

View reviewed changes

folkertdev merged commit 4dda12a into trifectatechfoundation:main Dec 12, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline the fast path of BitWriter::send_bits #258

Inline the fast path of BitWriter::send_bits #258

brian-pane commented Dec 8, 2024

folkertdev commented Dec 8, 2024

brian-pane commented Dec 8, 2024

brian-pane commented Dec 12, 2024

folkertdev Dec 12, 2024

brian-pane commented Dec 12, 2024

Inline the fast path of BitWriter::send_bits #258

Inline the fast path of BitWriter::send_bits #258

Conversation

brian-pane commented Dec 8, 2024

folkertdev commented Dec 8, 2024

brian-pane commented Dec 8, 2024

brian-pane commented Dec 12, 2024

folkertdev Dec 12, 2024

Choose a reason for hiding this comment

brian-pane commented Dec 12, 2024