-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline the fast path of BitWriter::send_bits #258
Inline the fast path of BitWriter::send_bits #258
Conversation
The "equal" and "greater" branches yield the same result when total_bits == Self::BIT_BUF_SIZE, so this change combines them to eliminate one conditional branch whenever the bit_buffer is full. Benchmark results before: ``` hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar" Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar Time (mean ± σ): 92.2 ms ± 1.2 ms [User: 79.9 ms, System: 12.3 ms] Range (min … max): 90.1 ms … 96.6 ms 100 runs ``` and with this change applied: ``` hyperfine -m 100 --warmup 10 "./target/release/examples/compress 1 rs silesia-small.tar" Benchmark 1: ./target/release/examples/compress 1 rs silesia-small.tar Time (mean ± σ): 91.3 ms ± 0.9 ms [User: 79.0 ms, System: 12.2 ms] Range (min … max): 89.7 ms … 95.2 ms 100 runs ```
Hi, For benchmarking it is crucial that you actually race the two implementations against each other: just comparing wall clock times is unreliable, especially when the times are so close together (and in fact their ranges overlap). When I run this commit agains its base commit, I actually observe a regression
(I'm using https://github.com/andrewrk/poop here) The branch that you remove is not actually conditional in practice: it is very well-predicted. While it doesn't look like it, this code is actually fairly optimized and it's hard to do better (I've tried!). So, I guess, see what you get when you run the two benchmarks alongside each other (hyperfine accepts multiple things to benchmark, you just need to copy the binary from before your change somewhere so you can use it after applying your change). |
Thanks, I'll adopt the same benchmark method. In my initial test run using the same benchmark driver, I see an improvement in branch mispredictions, although it's offset by an increase in instructions executed:
I suspect the difference in branch mispredictions compared to your results might be due to running on different CPU models. But I haven't yet figured out why my patch causes an increase in instructions. I'll look at it more later. |
Benchmark at compression level 1: ``` Benchmark 1 (60 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 83.5ms ± 2.24ms 81.0ms … 89.9ms 8 (13%) 0% peak_rss 26.7MB ± 65.2KB 26.6MB … 26.7MB 0 ( 0%) 0% cpu_cycles 298M ± 1.48M 297M … 307M 4 ( 7%) 0% instructions 645M ± 270 645M … 645M 0 ( 0%) 0% cache_references 403K ± 10.9K 396K … 462K 5 ( 8%) 0% cache_misses 301K ± 10.3K 262K … 319K 3 ( 5%) 0% branch_misses 3.06M ± 9.56K 3.04M … 3.08M 0 ( 0%) 0% Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 81.8ms ± 1.57ms 78.1ms … 88.2ms 1 ( 2%) ⚡- 2.0% ± 0.8% peak_rss 26.7MB ± 73.8KB 26.5MB … 26.7MB 0 ( 0%) - 0.1% ± 0.1% cpu_cycles 293M ± 2.67M 292M … 313M 2 ( 3%) ⚡- 1.7% ± 0.3% instructions 612M ± 280 612M … 612M 0 ( 0%) ⚡- 5.1% ± 0.0% cache_references 402K ± 11.6K 394K … 472K 7 (11%) - 0.1% ± 1.0% cache_misses 304K ± 6.26K 287K … 325K 1 ( 2%) + 1.2% ± 1.0% branch_misses 3.05M ± 3.70K 3.04M … 3.06M 0 ( 0%) - 0.3% ± 0.1% ``` The improvement decreases as the compression level increases. For example, at level 3, ``` Benchmark 1 (34 runs): ./compress-baseline 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 150ms ± 3.71ms 147ms … 169ms 3 ( 9%) 0% peak_rss 24.7MB ± 73.0KB 24.5MB … 24.8MB 0 ( 0%) 0% cpu_cycles 596M ± 4.91M 594M … 619M 3 ( 9%) 0% instructions 1.47G ± 340 1.47G … 1.47G 1 ( 3%) 0% cache_references 377K ± 6.86K 371K … 408K 4 (12%) 0% cache_misses 287K ± 4.20K 277K … 295K 3 ( 9%) 0% branch_misses 7.72M ± 7.17K 7.71M … 7.74M 0 ( 0%) 0% Benchmark 2 (34 runs): ./target/release/examples/compress 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 150ms ± 1.42ms 148ms … 156ms 1 ( 3%) + 0.4% ± 0.9% peak_rss 24.7MB ± 62.2KB 24.6MB … 24.8MB 0 ( 0%) + 0.0% ± 0.1% cpu_cycles 601M ± 4.16M 598M … 624M 1 ( 3%) + 0.7% ± 0.4% instructions 1.46G ± 270 1.46G … 1.46G 0 ( 0%) ⚡- 1.1% ± 0.0% cache_references 376K ± 4.08K 371K … 388K 1 ( 3%) - 0.3% ± 0.7% cache_misses 290K ± 5.37K 275K … 297K 5 (15%) + 0.9% ± 0.8% branch_misses 7.70M ± 5.88K 7.69M … 7.71M 0 ( 0%) - 0.3% ± 0.0% ```
I tried a different approach in 76b6c14: keeping the original logic, but inlining the fast path. It seems to work on my test system. |
} | ||
} | ||
|
||
fn send_bits_overflow(&mut self, val: u64, total_bits: u8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try if adding #[cold]
here makes a difference. It doesn't really for me. Otherwise this looks good!
I just tried with |
The "equal" and "greater" branches yield the same result when total_bits == Self::BIT_BUF_SIZE, so this change combines them to eliminate one conditional branch whenever the bit_buffer is full.
Benchmark results before:
and with this change applied: