Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearrange fields in struct State for cache locality #277

Closed

Conversation

brian-pane
Copy link

On my test system, this yields a small improvement in CPU cycles,

Benchmark 1 (68 runs): ./blogpost-compress-baseline-native 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.8ms ± 1.80ms    72.0ms … 86.1ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 84.6KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          279M  ± 1.11M      278M  …  287M           2 ( 3%)        0%
  instructions        568M  ±  234       568M  …  568M           1 ( 1%)        0%
  cache_references    265K  ± 4.64K      262K  …  298K           7 (10%)        0%
  cache_misses        233K  ± 6.59K      207K  …  244K           6 ( 9%)        0%
  branch_misses      2.90M  ± 4.04K     2.89M  … 2.91M           1 ( 1%)        0%
Benchmark 2 (69 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.0ms ±  884us    71.4ms … 77.8ms          6 ( 9%)          -  1.0% ±  0.6%
  peak_rss           26.6MB ± 97.7KB    26.5MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          277M  ± 1.76M      276M  …  290M           1 ( 1%)          -  0.9% ±  0.2%
  instructions        568M  ±  269       568M  …  568M           2 ( 3%)          +  0.0% ±  0.0%
  cache_references    265K  ± 5.08K      262K  …  303K           3 ( 4%)          -  0.0% ±  0.6%
  cache_misses        233K  ± 5.87K      210K  …  239K           6 ( 9%)          -  0.0% ±  0.9%
  branch_misses      2.88M  ± 4.12K     2.87M  … 2.89M           0 ( 0%)          -  0.7% ±  0.0%

@folkertdev
Copy link
Collaborator

Well, the change is not statistically significant (that is what the lightning ⚡ emoji indicates), so on its own this doesn't really do anything.

When combined with the other changes, is the performance gap smaller?

@brian-pane
Copy link
Author

With the additional changes in b2c6494, I'm getting an improvement in CPU cycles.

Benchmark 1 (67 runs): ./blogpost-compress-baseline-native 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          75.2ms ± 1.88ms    72.6ms … 87.8ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 89.1KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          280M  ± 1.05M      278M  …  284M           4 ( 6%)        0%
  instructions        568M  ±  391       568M  …  568M           1 ( 1%)        0%
  cache_references    270K  ± 8.40K      264K  …  302K           9 (13%)        0%
  cache_misses        237K  ± 6.24K      219K  …  246K           9 (13%)        0%
  branch_misses      2.90M  ± 3.55K     2.89M  … 2.91M           0 ( 0%)        0%
Benchmark 2 (68 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.0ms ± 1.18ms    71.8ms … 76.9ms          0 ( 0%)          -  1.7% ±  0.7%
  peak_rss           26.6MB ± 100.0KB   26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          276M  ± 1.57M      274M  …  286M           6 ( 9%)        ⚡-  1.4% ±  0.2%
  instructions        568M  ±  233       568M  …  568M           0 ( 0%)          +  0.0% ±  0.0%
  cache_references    269K  ± 7.41K      263K  …  300K           9 (13%)          -  0.3% ±  1.0%
  cache_misses        237K  ± 5.89K      219K  …  243K           9 (13%)          -  0.1% ±  0.9%
  branch_misses      2.88M  ± 5.22K     2.86M  … 2.89M           0 ( 0%)          -  0.8% ±  0.1%

(Tested on aarch64-apple-darwin)
Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting some mixed results, with a small regression at level 3

Benchmark 2 (33 runs): target/release/examples/blogpost-compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           154ms ± 2.95ms     152ms …  168ms          1 ( 3%)        💩+  2.1% ±  0.8%
  peak_rss           24.7MB ± 54.4KB    24.6MB … 24.8MB          7 (21%)          +  0.1% ±  0.1%
  cpu_cycles          639M  ± 11.5M      631M  …  695M           3 ( 9%)        💩+  1.9% ±  0.8%
  instructions       1.56G  ±  360      1.56G  … 1.56G           2 ( 6%)          +  0.5% ±  0.0%
  cache_references   43.8M  ±  467K     43.1M  … 44.9M           1 ( 3%)          +  0.1% ±  0.6%
  cache_misses       1.14M  ±  290K      837K  … 1.99M           0 ( 0%)          + 13.7% ± 14.5%
  branch_misses      7.79M  ± 6.15K     7.78M  … 7.81M           3 ( 9%)          -  0.0% ±  0.0%

I think the best path forward is to make the changes to the types that we want to make (use u16 instead of usize, compute instead of load values), add whatever padding is required, and then when the remaining fields have the desired types, try to remove the padding.

Any progress this PR might make could just be negated by later changes when field sizes change.

Comment on lines +4252 to 4270
#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]
#[test]
fn state_layout() {
use memoffset::offset_of;

// Empirically, deflate performance depends on the layout of fields within
// the State structure. If you change the order of fields and this test starts
// failing, the recommended action is to run some benchmark tests. If there
// is no surprise performance regression with the new layout, please move
// the _cache_line_N markers to the corresponding right locations in the
// struct to make this test pass once again.
assert_eq!(offset_of!(State, status), 0);
assert_eq!(offset_of!(State, strstart), 8);
assert_eq!(offset_of!(State, _cache_line_1), 64);
assert_eq!(offset_of!(State, _cache_line_2), 128);
assert_eq!(offset_of!(State, _cache_line_3), 192);
assert_eq!(offset_of!(State, _cache_line_4), 256);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could instead by a bunch of compile-time assertions

#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]
mod _cache_lines {
    use super::State;
    use core::mem::offset_of;

    // Empirically, deflate performance depends on the layout of fields within
    // the State structure. If you change the order of fields and this test starts
    // failing, the recommended action is to run some benchmark tests. If there
    // is no surprise performance regression with the new layout, please move
    // the _cache_line_N markers to the corresponding right locations in the
    // struct to make this test pass once again.
    const _: () = assert!(offset_of!(State, status) == 0);
    const _: () = assert!(offset_of!(State, strstart) == 8);
    const _: () = assert!(offset_of!(State, _cache_line_1) == 64);
    const _: () = assert!(offset_of!(State, _cache_line_2) == 128);
    const _: () = assert!(offset_of!(State, _cache_line_3) == 192);
    const _: () = assert!(offset_of!(State, _cache_line_4) == 256);
}

also with this approach, no external dependency is needed (I think right now that might cause issues with our msrv, because offset_of is kind of new, but we could consider bumping it from 1.75 to 1.77 to work around that dependency.

I think it would be a good idea to split these cache line markers into their own PR, so they are easy to disentangle from any changes that influence performance.

maybe try if CI passes with the core::mem::offset_of, otherwise using the external crate is fine for now.

@brian-pane
Copy link
Author

Thanks for reviewing! I’ll work on separate PRs for the cache line checks, replacing usize with u16/calculations, and then rearranging the fields to move fields that are often used together into the same cache line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants