[WIP] Test out removing lots of `#[inline]` from `hashbrown` #64846

alexcrichton · 2019-09-27T15:11:30Z

Some recent analysis shows that quite a lot of code is being
instantiated from hashbrown for rustc in particular, and so this is an
attempt to measure the impact of removing as much #[inline] as we can
from the HashMap implementation. The steps done to make this PR were:

Start with hashbrown's master branch
Run cargo bench
Remove all #[inline] annotations in the crate
Run cargo bench
Use results of cargo bench as well as profiling data to guide
re-insertion of #[inline] on some functions. Repeat until there's
rough parity with master branch's benchmarks

That's where we're at, so I'm curious now to run this through rust's
perf suite and see what the effect is, both when compiling hashmap-heavy
crates like Cargo as well as for rustc's own runtime which is a giant
hashmap benchmark.

rust-highfive · 2019-09-27T15:11:41Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2019-09-27T15:12:13Z

@bors try @rust-timer queue

rust-timer · 2019-09-27T15:12:14Z

Awaiting bors try build completion

alexcrichton · 2019-09-27T15:12:19Z

r? @ghost

bors · 2019-09-27T15:12:24Z

⌛ Trying commit 7ce3096f0d38c102603fb84e24cfbecc93f9004e with merge 050928df2832b1d1304b9aa0e435fd60738583b7...

bors · 2019-09-27T18:01:24Z

☀️ Try build successful - checks-azure
Build commit: 050928df2832b1d1304b9aa0e435fd60738583b7 (050928df2832b1d1304b9aa0e435fd60738583b7)

rust-timer · 2019-09-27T18:01:26Z

Queued 050928df2832b1d1304b9aa0e435fd60738583b7 with parent 590ae0e, future comparison URL.

rust-timer · 2019-09-27T19:42:21Z

Finished benchmarking try commit 050928df2832b1d1304b9aa0e435fd60738583b7, comparison URL.

alexcrichton · 2019-09-30T16:08:04Z

Ok so the results here are pretty interesting I believe. As expected if you don't #[inline] everything the effect is twofold:

One is that rustc is slightly slower because we're not basically forcing everything to get inlined. The slowdown is worst in clean incremental builds (huge hashmap benchmarks) and hovers around 5%.
Another is that the compile time of some crates actually goes down I believe because there's less to inline. For example original performance measurements showed a 20% slowdown when compiling Cargo and a 10% slowdown compile hyper (all in optimized mode). Looks like cargo-opt has been removed but hyper-opt/regex-opt is 10% faster after this PR.

I think this is showing to me that its wortwhile to pursue this, although I think a few more methods need #[inline] to mitigate the performance impact on the compiler.

Some [recent analysis][analyze] shows that quite a lot of code is being instantiated from `hashbrown` for rustc in particular, and so this is an attempt to measure the impact of removing as much `#[inline]` as we can from the `HashMap` implementation. The steps done to make this PR were: * Start with `hashbrown`'s `master` branch * Run `cargo bench` * Remove all `#[inline]` annotations in the crate * Run `cargo bench` * Use results of `cargo bench` as well as profiling data to guide re-insertion of `#[inline]` on some functions. Repeat until there's rough parity with `master` branch's benchmarks That's where we're at, so I'm curious now to run this through rust's perf suite and see what the effect is, both when compiling hashmap-heavy crates like Cargo as well as for rustc's own runtime which is a giant hashmap benchmark. [analyze]: https://rust-lang.zulipchat.com/#narrow/stream/187831-t-compiler.2Fwg-self-profile/topic/dumping.20sources.20of.20LLVM.20slowness/near/176740749

alexcrichton · 2019-09-30T20:34:55Z

Ok I've done some profile comparisons locally and used that to guide the insertion of #[inline] in a few locations. I'm now curious to see the impact on perf numbers:

@bors try @rust-timer queue

rust-timer · 2019-09-30T20:34:56Z

Awaiting bors try build completion

bors · 2019-09-30T20:35:06Z

⌛ Trying commit bd911ef with merge bbd2bf26b7d64cc96426b669126c4cb9f5330ade...

bors · 2019-09-30T23:57:23Z

☀️ Try build successful - checks-azure
Build commit: bbd2bf26b7d64cc96426b669126c4cb9f5330ade (bbd2bf26b7d64cc96426b669126c4cb9f5330ade)

rust-timer · 2019-09-30T23:57:24Z

Queued bbd2bf26b7d64cc96426b669126c4cb9f5330ade with parent 22bc9e1, future comparison URL.

rust-timer · 2019-10-01T10:09:54Z

Finished benchmarking try commit bbd2bf26b7d64cc96426b669126c4cb9f5330ade, comparison URL.

This commit goes through and deletes almost all `#[inline]` annotations in this crate. It looks like before this commit basically every single function is `#[inline]`, but this is generally not necessary for performance and can have a severe impact on compile times in both debug and release modes, most severely in release mode. Some `#[inline]` annotations are definitely necessary, however. Most functions in this crate are already candidates for inlining because they're generic, but functions like `Group` and `BitMask` aren't candidates for inlining without `#[inline]`. Additionally LLVM is by no means perfect, so some `#[inline]` may still be necessary to get some further speedups. The procedure used to generate this commit looked like: * Remove all `#[inline]` annotations. * Run `cargo bench`, comparing against the `master` branch, and add `#[inline]` to hot spots as necessary. * A [PR] was made against rust-lang/rust to [evaluate the impact][run1] on the compiler for more performance data. * Using this data, `perf diff` was used locally to determine further hot spots and more `#[inline]` annotations were added. * A [second round of benchmarking][run2] was done The numbers are at the point where I think this should land in the crate and get published to move into the standard library. There are up to 20% wins in compile time for hashmap-heavy crates (like Cargo) and milder wins (up to 10%) for a number of other large crates. The regressions are all in the 1-3% range and are largely on benchmarks taking a few handful of milliseconds anyway, which I'd personally say is a worthwhile tradeoff. For comparison, the benchmarks of this crate before and after this commit look like so: name baseline ns/iter new ns/iter diff ns/iter diff % speedup insert_ahash_highbits 7,137 9,044 1,907 26.72% x 0.79 insert_ahash_random 7,575 9,789 2,214 29.23% x 0.77 insert_ahash_serial 9,833 9,476 -357 -3.63% x 1.04 insert_erase_ahash_highbits 15,824 19,164 3,340 21.11% x 0.83 insert_erase_ahash_random 16,933 20,353 3,420 20.20% x 0.83 insert_erase_ahash_serial 20,857 27,675 6,818 32.69% x 0.75 insert_erase_std_highbits 35,117 38,385 3,268 9.31% x 0.91 insert_erase_std_random 35,357 37,236 1,879 5.31% x 0.95 insert_erase_std_serial 30,617 34,136 3,519 11.49% x 0.90 insert_std_highbits 15,675 18,180 2,505 15.98% x 0.86 insert_std_random 16,566 17,803 1,237 7.47% x 0.93 insert_std_serial 14,612 16,025 1,413 9.67% x 0.91 iter_ahash_highbits 1,715 1,640 -75 -4.37% x 1.05 iter_ahash_random 1,721 1,634 -87 -5.06% x 1.05 iter_ahash_serial 1,723 1,636 -87 -5.05% x 1.05 iter_std_highbits 1,715 1,634 -81 -4.72% x 1.05 iter_std_random 1,715 1,637 -78 -4.55% x 1.05 iter_std_serial 1,722 1,637 -85 -4.94% x 1.05 lookup_ahash_highbits 4,565 5,809 1,244 27.25% x 0.79 lookup_ahash_random 4,632 4,047 -585 -12.63% x 1.14 lookup_ahash_serial 4,612 4,906 294 6.37% x 0.94 lookup_fail_ahash_highbits 4,206 3,976 -230 -5.47% x 1.06 lookup_fail_ahash_random 4,327 4,211 -116 -2.68% x 1.03 lookup_fail_ahash_serial 8,999 4,386 -4,613 -51.26% x 2.05 lookup_fail_std_highbits 13,284 13,342 58 0.44% x 1.00 lookup_fail_std_random 13,172 13,614 442 3.36% x 0.97 lookup_fail_std_serial 11,240 11,539 299 2.66% x 0.97 lookup_std_highbits 13,075 13,333 258 1.97% x 0.98 lookup_std_random 13,257 13,193 -64 -0.48% x 1.00 lookup_std_serial 10,782 10,917 135 1.25% x 0.99 The summary of this from what I can tell is that the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving, as expected. In general I would be surprised if there's much of a significant performance regression attributed to this commit, and `#[inline]` can always be selectively added back in easily without adding it to every function in the crate. [PR]: rust-lang/rust#64846 [run1]: rust-lang/rust#64846 (comment) [run2]: rust-lang/rust#64846 (comment)

alexcrichton · 2019-10-01T13:04:19Z

I've taken this data to create rust-lang/hashbrown#119

alexcrichton · 2019-10-01T13:31:27Z

cc @nnethercote, you might be interested in this and the companion PR to hashbrown

nnethercote · 2019-10-01T19:43:35Z

This is an interesting one. There are some big wins on a few workloads, and small losses on many workloads, so it's a tricky one to evaluate. It's probably best to consider opt, debug and check builds separately. I will exclude the clean incremental numbers because they are less interesting.

opt builds have the best results. About 1/3 have big wins, up to 18%. About 1/3 have small losses, thought mostly less than 1%.
debug builds have a few wins (best is 5%), and lots of <1% losses.
check builds have a few wins (best is 5%), and lots of 1-3% losses.

A good thing about this is that opt and debug builds (but especially opt) are generally harder to improve than check builds. A bad thing is that check builds represent the front-end, and slowing down the front-end hurts pipelining.

Overall I'm leaning towards accepting this, at least from a rustc POV. It sure would be nice to have the tool proposed in rust-lang/measureme#51.

alexcrichton · 2019-10-02T15:04:56Z

Excellent points! (and inspires me to open an issue). This is indeed basically directly targeted at opt builds basically. For debug builds #[inline] is effectively ignored and for all rustc operations this is likely to slightly regress since everything isn't forcibly inlined and different optimization decisions are being made.

This commit goes through and deletes almost all `#[inline]` annotations in this crate. It looks like before this commit basically every single function is `#[inline]`, but this is generally not necessary for performance and can have a severe impact on compile times in both debug and release modes, most severely in release mode. Some `#[inline]` annotations are definitely necessary, however. Most functions in this crate are already candidates for inlining because they're generic, but functions like `Group` and `BitMask` aren't candidates for inlining without `#[inline]`. Additionally LLVM is by no means perfect, so some `#[inline]` may still be necessary to get some further speedups. The procedure used to generate this commit looked like: * Remove all `#[inline]` annotations. * Run `cargo bench`, comparing against the `master` branch, and add `#[inline]` to hot spots as necessary. * A [PR] was made against rust-lang/rust to [evaluate the impact][run1] on the compiler for more performance data. * Using this data, `perf diff` was used locally to determine further hot spots and more `#[inline]` annotations were added. * A [second round of benchmarking][run2] was done The numbers are at the point where I think this should land in the crate and get published to move into the standard library. There are up to 20% wins in compile time for hashmap-heavy crates (like Cargo) and milder wins (up to 10%) for a number of other large crates. The regressions are all in the 1-3% range and are largely on benchmarks taking a few handful of milliseconds anyway, which I'd personally say is a worthwhile tradeoff. For comparison, the benchmarks of this crate before and after this commit look like so: name baseline ns/iter new ns/iter diff ns/iter diff % speedup insert_ahash_highbits 7,137 9,044 1,907 26.72% x 0.79 insert_ahash_random 7,575 9,789 2,214 29.23% x 0.77 insert_ahash_serial 9,833 9,476 -357 -3.63% x 1.04 insert_erase_ahash_highbits 15,824 19,164 3,340 21.11% x 0.83 insert_erase_ahash_random 16,933 20,353 3,420 20.20% x 0.83 insert_erase_ahash_serial 20,857 27,675 6,818 32.69% x 0.75 insert_erase_std_highbits 35,117 38,385 3,268 9.31% x 0.91 insert_erase_std_random 35,357 37,236 1,879 5.31% x 0.95 insert_erase_std_serial 30,617 34,136 3,519 11.49% x 0.90 insert_std_highbits 15,675 18,180 2,505 15.98% x 0.86 insert_std_random 16,566 17,803 1,237 7.47% x 0.93 insert_std_serial 14,612 16,025 1,413 9.67% x 0.91 iter_ahash_highbits 1,715 1,640 -75 -4.37% x 1.05 iter_ahash_random 1,721 1,634 -87 -5.06% x 1.05 iter_ahash_serial 1,723 1,636 -87 -5.05% x 1.05 iter_std_highbits 1,715 1,634 -81 -4.72% x 1.05 iter_std_random 1,715 1,637 -78 -4.55% x 1.05 iter_std_serial 1,722 1,637 -85 -4.94% x 1.05 lookup_ahash_highbits 4,565 5,809 1,244 27.25% x 0.79 lookup_ahash_random 4,632 4,047 -585 -12.63% x 1.14 lookup_ahash_serial 4,612 4,906 294 6.37% x 0.94 lookup_fail_ahash_highbits 4,206 3,976 -230 -5.47% x 1.06 lookup_fail_ahash_random 4,327 4,211 -116 -2.68% x 1.03 lookup_fail_ahash_serial 8,999 4,386 -4,613 -51.26% x 2.05 lookup_fail_std_highbits 13,284 13,342 58 0.44% x 1.00 lookup_fail_std_random 13,172 13,614 442 3.36% x 0.97 lookup_fail_std_serial 11,240 11,539 299 2.66% x 0.97 lookup_std_highbits 13,075 13,333 258 1.97% x 0.98 lookup_std_random 13,257 13,193 -64 -0.48% x 1.00 lookup_std_serial 10,782 10,917 135 1.25% x 0.99 The summary of this from what I can tell is that the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving, as expected. In general I would be surprised if there's much of a significant performance regression attributed to this commit, and `#[inline]` can always be selectively added back in easily without adding it to every function in the crate. [PR]: rust-lang/rust#64846 [run1]: rust-lang/rust#64846 (comment) [run2]: rust-lang/rust#64846 (comment)

Remove most `#[inline]` annotations This commit goes through and deletes almost all `#[inline]` annotations in this crate. It looks like before this commit basically every single function is `#[inline]`, but this is generally not necessary for performance and can have a severe impact on compile times in both debug and release modes, most severely in release mode. Some `#[inline]` annotations are definitely necessary, however. Most functions in this crate are already candidates for inlining because they're generic, but functions like `Group` and `BitMask` aren't candidates for inlining without `#[inline]`. Additionally LLVM is by no means perfect, so some `#[inline]` may still be necessary to get some further speedups. The procedure used to generate this commit looked like: * Remove all `#[inline]` annotations. * Run `cargo bench`, comparing against the `master` branch, and add `#[inline]` to hot spots as necessary. * A [PR] was made against rust-lang/rust to [evaluate the impact][run1] on the compiler for more performance data. * Using this data, `perf diff` was used locally to determine further hot spots and more `#[inline]` annotations were added. * A [second round of benchmarking][run2] was done The numbers are at the point where I think this should land in the crate and get published to move into the standard library. There are up to 20% wins in compile time for hashmap-heavy crates (like Cargo) and milder wins (up to 10%) for a number of other large crates. The regressions are all in the 1-3% range and are largely on benchmarks taking a few handful of milliseconds anyway, which I'd personally say is a worthwhile tradeoff. For comparison, the benchmarks of this crate before and after this commit look like so: ``` name baseline ns/iter new ns/iter diff ns/iter diff % speedup insert_ahash_highbits 7,137 9,044 1,907 26.72% x 0.79 insert_ahash_random 7,575 9,789 2,214 29.23% x 0.77 insert_ahash_serial 9,833 9,476 -357 -3.63% x 1.04 insert_erase_ahash_highbits 15,824 19,164 3,340 21.11% x 0.83 insert_erase_ahash_random 16,933 20,353 3,420 20.20% x 0.83 insert_erase_ahash_serial 20,857 27,675 6,818 32.69% x 0.75 insert_erase_std_highbits 35,117 38,385 3,268 9.31% x 0.91 insert_erase_std_random 35,357 37,236 1,879 5.31% x 0.95 insert_erase_std_serial 30,617 34,136 3,519 11.49% x 0.90 insert_std_highbits 15,675 18,180 2,505 15.98% x 0.86 insert_std_random 16,566 17,803 1,237 7.47% x 0.93 insert_std_serial 14,612 16,025 1,413 9.67% x 0.91 iter_ahash_highbits 1,715 1,640 -75 -4.37% x 1.05 iter_ahash_random 1,721 1,634 -87 -5.06% x 1.05 iter_ahash_serial 1,723 1,636 -87 -5.05% x 1.05 iter_std_highbits 1,715 1,634 -81 -4.72% x 1.05 iter_std_random 1,715 1,637 -78 -4.55% x 1.05 iter_std_serial 1,722 1,637 -85 -4.94% x 1.05 lookup_ahash_highbits 4,565 5,809 1,244 27.25% x 0.79 lookup_ahash_random 4,632 4,047 -585 -12.63% x 1.14 lookup_ahash_serial 4,612 4,906 294 6.37% x 0.94 lookup_fail_ahash_highbits 4,206 3,976 -230 -5.47% x 1.06 lookup_fail_ahash_random 4,327 4,211 -116 -2.68% x 1.03 lookup_fail_ahash_serial 8,999 4,386 -4,613 -51.26% x 2.05 lookup_fail_std_highbits 13,284 13,342 58 0.44% x 1.00 lookup_fail_std_random 13,172 13,614 442 3.36% x 0.97 lookup_fail_std_serial 11,240 11,539 299 2.66% x 0.97 lookup_std_highbits 13,075 13,333 258 1.97% x 0.98 lookup_std_random 13,257 13,193 -64 -0.48% x 1.00 lookup_std_serial 10,782 10,917 135 1.25% x 0.99 ``` The summary of this from what I can tell is that the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving, as expected. In general I would be surprised if there's much of a significant performance regression attributed to this commit, and `#[inline]` can always be selectively added back in easily without adding it to every function in the crate. [PR]: rust-lang/rust#64846 [run1]: rust-lang/rust#64846 (comment) [run2]: rust-lang/rust#64846 (comment)

rust-highfive assigned nikomatsakis Sep 27, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 27, 2019

Mark-Simulacrum unassigned nikomatsakis Sep 27, 2019

alexcrichton force-pushed the less-inline-hashbrown branch from 7ce3096 to bd911ef Compare September 30, 2019 20:32

alexcrichton mentioned this pull request Oct 1, 2019

Remove most #[inline] annotations rust-lang/hashbrown#119

Merged

alexcrichton closed this Oct 1, 2019

alexcrichton deleted the less-inline-hashbrown branch October 1, 2019 13:04

alexcrichton mentioned this pull request Oct 2, 2019

View data sorted by check/opt/debug rust-lang/rustc-perf#499

Closed

[WIP] Test out removing lots of #[inline] from hashbrown #64846

[WIP] Test out removing lots of #[inline] from hashbrown #64846

Uh oh!

Conversation

alexcrichton commented Sep 27, 2019

Uh oh!

rust-highfive commented Sep 27, 2019

Uh oh!

alexcrichton commented Sep 27, 2019

Uh oh!

rust-timer commented Sep 27, 2019

Uh oh!

alexcrichton commented Sep 27, 2019

Uh oh!

bors commented Sep 27, 2019

Uh oh!

bors commented Sep 27, 2019

Uh oh!

rust-timer commented Sep 27, 2019

Uh oh!

rust-timer commented Sep 27, 2019

Uh oh!

alexcrichton commented Sep 30, 2019

Uh oh!

alexcrichton commented Sep 30, 2019

Uh oh!

rust-timer commented Sep 30, 2019

Uh oh!

bors commented Sep 30, 2019

Uh oh!

bors commented Sep 30, 2019

Uh oh!

rust-timer commented Sep 30, 2019

Uh oh!

rust-timer commented Oct 1, 2019

Uh oh!

alexcrichton commented Oct 1, 2019

Uh oh!

alexcrichton commented Oct 1, 2019

Uh oh!

nnethercote commented Oct 1, 2019

Uh oh!

alexcrichton commented Oct 2, 2019

Uh oh!

Uh oh!

[WIP] Test out removing lots of `#[inline]` from `hashbrown` #64846

[WIP] Test out removing lots of `#[inline]` from `hashbrown` #64846