-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rearrange fields in struct State for cache locality #277
Rearrange fields in struct State for cache locality #277
Conversation
Well, the change is not statistically significant (that is what the lightning ⚡ emoji indicates), so on its own this doesn't really do anything. When combined with the other changes, is the performance gap smaller? |
With the additional changes in b2c6494, I'm getting an improvement in CPU cycles.
|
(Tested on aarch64-apple-darwin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting some mixed results, with a small regression at level 3
Benchmark 2 (33 runs): target/release/examples/blogpost-compress 3 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 154ms ± 2.95ms 152ms … 168ms 1 ( 3%) 💩+ 2.1% ± 0.8%
peak_rss 24.7MB ± 54.4KB 24.6MB … 24.8MB 7 (21%) + 0.1% ± 0.1%
cpu_cycles 639M ± 11.5M 631M … 695M 3 ( 9%) 💩+ 1.9% ± 0.8%
instructions 1.56G ± 360 1.56G … 1.56G 2 ( 6%) + 0.5% ± 0.0%
cache_references 43.8M ± 467K 43.1M … 44.9M 1 ( 3%) + 0.1% ± 0.6%
cache_misses 1.14M ± 290K 837K … 1.99M 0 ( 0%) + 13.7% ± 14.5%
branch_misses 7.79M ± 6.15K 7.78M … 7.81M 3 ( 9%) - 0.0% ± 0.0%
I think the best path forward is to make the changes to the types that we want to make (use u16 instead of usize, compute instead of load values), add whatever padding is required, and then when the remaining fields have the desired types, try to remove the padding.
Any progress this PR might make could just be negated by later changes when field sizes change.
#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))] | ||
#[test] | ||
fn state_layout() { | ||
use memoffset::offset_of; | ||
|
||
// Empirically, deflate performance depends on the layout of fields within | ||
// the State structure. If you change the order of fields and this test starts | ||
// failing, the recommended action is to run some benchmark tests. If there | ||
// is no surprise performance regression with the new layout, please move | ||
// the _cache_line_N markers to the corresponding right locations in the | ||
// struct to make this test pass once again. | ||
assert_eq!(offset_of!(State, status), 0); | ||
assert_eq!(offset_of!(State, strstart), 8); | ||
assert_eq!(offset_of!(State, _cache_line_1), 64); | ||
assert_eq!(offset_of!(State, _cache_line_2), 128); | ||
assert_eq!(offset_of!(State, _cache_line_3), 192); | ||
assert_eq!(offset_of!(State, _cache_line_4), 256); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could instead by a bunch of compile-time assertions
#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]
mod _cache_lines {
use super::State;
use core::mem::offset_of;
// Empirically, deflate performance depends on the layout of fields within
// the State structure. If you change the order of fields and this test starts
// failing, the recommended action is to run some benchmark tests. If there
// is no surprise performance regression with the new layout, please move
// the _cache_line_N markers to the corresponding right locations in the
// struct to make this test pass once again.
const _: () = assert!(offset_of!(State, status) == 0);
const _: () = assert!(offset_of!(State, strstart) == 8);
const _: () = assert!(offset_of!(State, _cache_line_1) == 64);
const _: () = assert!(offset_of!(State, _cache_line_2) == 128);
const _: () = assert!(offset_of!(State, _cache_line_3) == 192);
const _: () = assert!(offset_of!(State, _cache_line_4) == 256);
}
also with this approach, no external dependency is needed (I think right now that might cause issues with our msrv, because offset_of
is kind of new, but we could consider bumping it from 1.75 to 1.77 to work around that dependency.
I think it would be a good idea to split these cache line markers into their own PR, so they are easy to disentangle from any changes that influence performance.
maybe try if CI passes with the core::mem::offset_of
, otherwise using the external crate is fine for now.
Thanks for reviewing! I’ll work on separate PRs for the cache line checks, replacing usize with u16/calculations, and then rearranging the fields to move fields that are often used together into the same cache line. |
On my test system, this yields a small improvement in CPU cycles,