Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline the fast path of BitWriter::send_bits #258

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

brian-pane
Copy link

The "equal" and "greater" branches yield the same result when total_bits == Self::BIT_BUF_SIZE, so this change combines them to eliminate one conditional branch whenever the bit_buffer is full.

Benchmark results before:

hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      92.2 ms ±   1.2 ms    [User: 79.9 ms, System: 12.3 ms]
  Range (min … max):    90.1 ms …  96.6 ms    100 runs

and with this change applied:

hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      91.3 ms ±   0.9 ms    [User: 79.0 ms, System: 12.2 ms]
  Range (min … max):    89.7 ms …  95.2 ms    100 runs

The "equal" and "greater" branches yield the same result when
total_bits == Self::BIT_BUF_SIZE, so this change combines them
to eliminate one conditional branch whenever the bit_buffer is
full.

Benchmark results before:
```
hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      92.2 ms ±   1.2 ms    [User: 79.9 ms, System: 12.3 ms]
  Range (min … max):    90.1 ms …  96.6 ms    100 runs
```
and with this change applied:
```
hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar"
Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar
  Time (mean ± σ):      91.3 ms ±   0.9 ms    [User: 79.0 ms, System: 12.2 ms]
  Range (min … max):    89.7 ms …  95.2 ms    100 runs
```
@folkertdev
Copy link
Collaborator

Hi,

For benchmarking it is crucial that you actually race the two implementations against each other: just comparing wall clock times is unreliable, especially when the times are so close together (and in fact their ranges overlap).

When I run this commit agains its base commit, I actually observe a regression

Benchmark 1 (65 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          77.2ms ±  844us    76.2ms … 82.6ms          4 ( 6%)        0%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          283M  ± 2.02M      279M  …  292M           4 ( 6%)        0%
  instructions        633M  ±  277       633M  …  633M           0 ( 0%)        0%
  cache_references   20.0M  ±  162K     19.7M  … 20.7M           2 ( 3%)        0%
  cache_misses        466K  ± 67.0K      374K  …  735K           2 ( 3%)        0%
  branch_misses      3.06M  ± 7.05K     3.05M  … 3.09M           1 ( 2%)        0%
Benchmark 2 (64 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          79.1ms ±  422us    78.5ms … 80.3ms          2 ( 3%)        💩+  2.5% ±  0.3%
  peak_rss           26.7MB ± 68.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          292M  ± 1.31M      290M  …  297M           2 ( 3%)        💩+  3.3% ±  0.2%
  instructions        659M  ±  300       659M  …  659M           1 ( 2%)        💩+  4.1% ±  0.0%
  cache_references   19.9M  ±  190K     19.6M  … 20.5M           1 ( 2%)          -  0.8% ±  0.3%
  cache_misses        455K  ± 79.0K      353K  …  723K           1 ( 2%)          -  2.3% ±  5.4%
  branch_misses      3.11M  ± 2.09K     3.11M  … 3.12M           0 ( 0%)        💩+  1.6% ±  0.1%

(I'm using https://github.com/andrewrk/poop here)

The branch that you remove is not actually conditional in practice: it is very well-predicted. While it doesn't look like it, this code is actually fairly optimized and it's hard to do better (I've tried!).

So, I guess, see what you get when you run the two benchmarks alongside each other (hyperfine accepts multiple things to benchmark, you just need to copy the binary from before your change somewhere so you can use it after applying your change).

@brian-pane
Copy link
Author

Thanks, I'll adopt the same benchmark method. In my initial test run using the same benchmark driver, I see an improvement in branch mispredictions, although it's offset by an increase in instructions executed:

poop "./compress-baseline 1 rs silesia-small.tar" "./target/release/examples/compress 1 rs silesia-small.tar" 
Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.6ms ± 1.51ms    90.3ms …  101ms          4 ( 7%)        0%
  peak_rss           26.7MB ± 67.0KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          341M  ±  706K      340M  …  343M           0 ( 0%)        0%
  instructions        748M  ±  234       748M  …  748M           0 ( 0%)        0%
  cache_references    399K  ± 2.02K      396K  …  406K           3 ( 5%)        0%
  cache_misses        294K  ± 8.94K      266K  …  313K           6 (11%)        0%
  branch_misses      3.28M  ± 4.31K     3.27M  … 3.29M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          90.5ms ±  541us    88.6ms … 92.0ms          6 (11%)          -  1.2% ±  0.5%
  peak_rss           26.7MB ± 64.8KB    26.5MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          339M  ±  578K      338M  …  341M           0 ( 0%)          -  0.6% ±  0.1%
  instructions        771M  ±  257       771M  …  771M           0 ( 0%)        💩+  3.1% ±  0.0%
  cache_references    398K  ± 2.55K      396K  …  412K           7 (13%)          -  0.1% ±  0.2%
  cache_misses        293K  ± 9.10K      266K  …  304K           9 (16%)          -  0.5% ±  1.2%
  branch_misses      3.20M  ± 10.6K     3.18M  … 3.22M           0 ( 0%)        ⚡-  2.4% ±  0.1%

I suspect the difference in branch mispredictions compared to your results might be due to running on different CPU models. But I haven't yet figured out why my patch causes an increase in instructions. I'll look at it more later.

brian-pane and others added 2 commits December 12, 2024 11:14
Benchmark at compression level 1:
```
Benchmark 1 (60 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          83.5ms ± 2.24ms    81.0ms … 89.9ms          8 (13%)        0%
  peak_rss           26.7MB ± 65.2KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          298M  ± 1.48M      297M  …  307M           4 ( 7%)        0%
  instructions        645M  ±  270       645M  …  645M           0 ( 0%)        0%
  cache_references    403K  ± 10.9K      396K  …  462K           5 ( 8%)        0%
  cache_misses        301K  ± 10.3K      262K  …  319K           3 ( 5%)        0%
  branch_misses      3.06M  ± 9.56K     3.04M  … 3.08M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.8ms ± 1.57ms    78.1ms … 88.2ms          1 ( 2%)        ⚡-  2.0% ±  0.8%
  peak_rss           26.7MB ± 73.8KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          293M  ± 2.67M      292M  …  313M           2 ( 3%)        ⚡-  1.7% ±  0.3%
  instructions        612M  ±  280       612M  …  612M           0 ( 0%)        ⚡-  5.1% ±  0.0%
  cache_references    402K  ± 11.6K      394K  …  472K           7 (11%)          -  0.1% ±  1.0%
  cache_misses        304K  ± 6.26K      287K  …  325K           1 ( 2%)          +  1.2% ±  1.0%
  branch_misses      3.05M  ± 3.70K     3.04M  … 3.06M           0 ( 0%)          -  0.3% ±  0.1%
```

The improvement decreases as the compression level increases.
For example, at level 3,
```
Benchmark 1 (34 runs): ./compress-baseline 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           150ms ± 3.71ms     147ms …  169ms          3 ( 9%)        0%
  peak_rss           24.7MB ± 73.0KB    24.5MB … 24.8MB          0 ( 0%)        0%
  cpu_cycles          596M  ± 4.91M      594M  …  619M           3 ( 9%)        0%
  instructions       1.47G  ±  340      1.47G  … 1.47G           1 ( 3%)        0%
  cache_references    377K  ± 6.86K      371K  …  408K           4 (12%)        0%
  cache_misses        287K  ± 4.20K      277K  …  295K           3 ( 9%)        0%
  branch_misses      7.72M  ± 7.17K     7.71M  … 7.74M           0 ( 0%)        0%
Benchmark 2 (34 runs): ./target/release/examples/compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           150ms ± 1.42ms     148ms …  156ms          1 ( 3%)          +  0.4% ±  0.9%
  peak_rss           24.7MB ± 62.2KB    24.6MB … 24.8MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          601M  ± 4.16M      598M  …  624M           1 ( 3%)          +  0.7% ±  0.4%
  instructions       1.46G  ±  270      1.46G  … 1.46G           0 ( 0%)        ⚡-  1.1% ±  0.0%
  cache_references    376K  ± 4.08K      371K  …  388K           1 ( 3%)          -  0.3% ±  0.7%
  cache_misses        290K  ± 5.37K      275K  …  297K           5 (15%)          +  0.9% ±  0.8%
  branch_misses      7.70M  ± 5.88K     7.69M  … 7.71M           0 ( 0%)          -  0.3% ±  0.0%
```
@brian-pane brian-pane changed the title Small speedup by simplifying BitWriter::send_bits Inline the fast path of BitWriter::send_bits Dec 12, 2024
@brian-pane
Copy link
Author

I tried a different approach in 76b6c14: keeping the original logic, but inlining the fast path. It seems to work on my test system.

}
}

fn send_bits_overflow(&mut self, val: u64, total_bits: u8) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try if adding #[cold] here makes a difference. It doesn't really for me. Otherwise this looks good!

@brian-pane
Copy link
Author

I just tried with #[cold] and the results look basically unchanged on my system.

@folkertdev folkertdev merged commit 4dda12a into trifectatechfoundation:main Dec 12, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants