Rearrange fields in struct State for cache locality #277

brian-pane · 2025-01-04T14:31:24Z

On my test system, this yields a small improvement in CPU cycles,

Benchmark 1 (68 runs): ./blogpost-compress-baseline-native 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.8ms ± 1.80ms    72.0ms … 86.1ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 84.6KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          279M  ± 1.11M      278M  …  287M           2 ( 3%)        0%
  instructions        568M  ±  234       568M  …  568M           1 ( 1%)        0%
  cache_references    265K  ± 4.64K      262K  …  298K           7 (10%)        0%
  cache_misses        233K  ± 6.59K      207K  …  244K           6 ( 9%)        0%
  branch_misses      2.90M  ± 4.04K     2.89M  … 2.91M           1 ( 1%)        0%
Benchmark 2 (69 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.0ms ±  884us    71.4ms … 77.8ms          6 ( 9%)          -  1.0% ±  0.6%
  peak_rss           26.6MB ± 97.7KB    26.5MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          277M  ± 1.76M      276M  …  290M           1 ( 1%)          -  0.9% ±  0.2%
  instructions        568M  ±  269       568M  …  568M           2 ( 3%)          +  0.0% ±  0.0%
  cache_references    265K  ± 5.08K      262K  …  303K           3 ( 4%)          -  0.0% ±  0.6%
  cache_misses        233K  ± 5.87K      210K  …  239K           6 ( 9%)          -  0.0% ±  0.9%
  branch_misses      2.88M  ± 4.12K     2.87M  … 2.89M           0 ( 0%)          -  0.7% ±  0.0%

folkertdev · 2025-01-04T14:56:41Z

Well, the change is not statistically significant (that is what the lightning ⚡ emoji indicates), so on its own this doesn't really do anything.

When combined with the other changes, is the performance gap smaller?

brian-pane · 2025-01-04T23:36:06Z

With the additional changes in b2c6494, I'm getting an improvement in CPU cycles.

Benchmark 1 (67 runs): ./blogpost-compress-baseline-native 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          75.2ms ± 1.88ms    72.6ms … 87.8ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 89.1KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          280M  ± 1.05M      278M  …  284M           4 ( 6%)        0%
  instructions        568M  ±  391       568M  …  568M           1 ( 1%)        0%
  cache_references    270K  ± 8.40K      264K  …  302K           9 (13%)        0%
  cache_misses        237K  ± 6.24K      219K  …  246K           9 (13%)        0%
  branch_misses      2.90M  ± 3.55K     2.89M  … 2.91M           0 ( 0%)        0%
Benchmark 2 (68 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.0ms ± 1.18ms    71.8ms … 76.9ms          0 ( 0%)          -  1.7% ±  0.7%
  peak_rss           26.6MB ± 100.0KB   26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          276M  ± 1.57M      274M  …  286M           6 ( 9%)        ⚡-  1.4% ±  0.2%
  instructions        568M  ±  233       568M  …  568M           0 ( 0%)          +  0.0% ±  0.0%
  cache_references    269K  ± 7.41K      263K  …  300K           9 (13%)          -  0.3% ±  1.0%
  cache_misses        237K  ± 5.89K      219K  …  243K           9 (13%)          -  0.1% ±  0.9%
  branch_misses      2.88M  ± 5.22K     2.86M  … 2.89M           0 ( 0%)          -  0.8% ±  0.1%

(Tested on aarch64-apple-darwin)

folkertdev

I'm getting some mixed results, with a small regression at level 3

Benchmark 2 (33 runs): target/release/examples/blogpost-compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           154ms ± 2.95ms     152ms …  168ms          1 ( 3%)        💩+  2.1% ±  0.8%
  peak_rss           24.7MB ± 54.4KB    24.6MB … 24.8MB          7 (21%)          +  0.1% ±  0.1%
  cpu_cycles          639M  ± 11.5M      631M  …  695M           3 ( 9%)        💩+  1.9% ±  0.8%
  instructions       1.56G  ±  360      1.56G  … 1.56G           2 ( 6%)          +  0.5% ±  0.0%
  cache_references   43.8M  ±  467K     43.1M  … 44.9M           1 ( 3%)          +  0.1% ±  0.6%
  cache_misses       1.14M  ±  290K      837K  … 1.99M           0 ( 0%)          + 13.7% ± 14.5%
  branch_misses      7.79M  ± 6.15K     7.78M  … 7.81M           3 ( 9%)          -  0.0% ±  0.0%

I think the best path forward is to make the changes to the types that we want to make (use u16 instead of usize, compute instead of load values), add whatever padding is required, and then when the remaining fields have the desired types, try to remove the padding.

Any progress this PR might make could just be negated by later changes when field sizes change.

folkertdev · 2025-01-06T16:19:36Z

zlib-rs/src/deflate.rs

+    #[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]
+    #[test]
+    fn state_layout() {
+        use memoffset::offset_of;
+
+        // Empirically, deflate performance depends on the layout of fields within
+        // the State structure. If you change the order of fields and this test starts
+        // failing, the recommended action is to run some benchmark tests. If there
+        // is no surprise performance regression with the new layout, please move
+        // the _cache_line_N markers to the corresponding right locations in the
+        // struct to make this test pass once again.
+        assert_eq!(offset_of!(State, status), 0);
+        assert_eq!(offset_of!(State, strstart), 8);
+        assert_eq!(offset_of!(State, _cache_line_1), 64);
+        assert_eq!(offset_of!(State, _cache_line_2), 128);
+        assert_eq!(offset_of!(State, _cache_line_3), 192);
+        assert_eq!(offset_of!(State, _cache_line_4), 256);
+    }
 }


this could instead by a bunch of compile-time assertions

#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))] mod _cache_lines { use super::State; use core::mem::offset_of; // Empirically, deflate performance depends on the layout of fields within // the State structure. If you change the order of fields and this test starts // failing, the recommended action is to run some benchmark tests. If there // is no surprise performance regression with the new layout, please move // the _cache_line_N markers to the corresponding right locations in the // struct to make this test pass once again. const _: () = assert!(offset_of!(State, status) == 0); const _: () = assert!(offset_of!(State, strstart) == 8); const _: () = assert!(offset_of!(State, _cache_line_1) == 64); const _: () = assert!(offset_of!(State, _cache_line_2) == 128); const _: () = assert!(offset_of!(State, _cache_line_3) == 192); const _: () = assert!(offset_of!(State, _cache_line_4) == 256); }

also with this approach, no external dependency is needed (I think right now that might cause issues with our msrv, because offset_of is kind of new, but we could consider bumping it from 1.75 to 1.77 to work around that dependency.

I think it would be a good idea to split these cache line markers into their own PR, so they are easy to disentangle from any changes that influence performance.

maybe try if CI passes with the core::mem::offset_of, otherwise using the external crate is fine for now.

brian-pane · 2025-01-06T16:56:07Z

Thanks for reviewing! I’ll work on separate PRs for the cache line checks, replacing usize with u16/calculations, and then rearranging the fields to move fields that are often used together into the same cache line.

Rearrange fields in struct State for cache locality

9c68d1b

brianpane added 2 commits January 4, 2025 15:26

Additional reordering of the struct fields

b2c6494

Run the struct layout test only for x86_64

dd5f8c8

Run the state_layout test on aarch64 too.

b9162a1

(Tested on aarch64-apple-darwin)

folkertdev reviewed Jan 6, 2025

View reviewed changes

brian-pane mentioned this pull request Jan 6, 2025

Add checks for State memory layout #278

Merged

brian-pane closed this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearrange fields in struct State for cache locality #277

Rearrange fields in struct State for cache locality #277

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025

brian-pane commented Jan 4, 2025

folkertdev left a comment

folkertdev Jan 6, 2025

brian-pane commented Jan 6, 2025

Rearrange fields in struct State for cache locality #277

Rearrange fields in struct State for cache locality #277

Conversation

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025

brian-pane commented Jan 4, 2025

folkertdev left a comment

Choose a reason for hiding this comment

folkertdev Jan 6, 2025

Choose a reason for hiding this comment

brian-pane commented Jan 6, 2025