Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Test out removing lots of #[inline] from hashbrown #64846

Closed

Conversation

alexcrichton
Copy link
Member

Some recent analysis shows that quite a lot of code is being
instantiated from hashbrown for rustc in particular, and so this is an
attempt to measure the impact of removing as much #[inline] as we can
from the HashMap implementation. The steps done to make this PR were:

  • Start with hashbrown's master branch
  • Run cargo bench
  • Remove all #[inline] annotations in the crate
  • Run cargo bench
  • Use results of cargo bench as well as profiling data to guide
    re-insertion of #[inline] on some functions. Repeat until there's
    rough parity with master branch's benchmarks

That's where we're at, so I'm curious now to run this through rust's
perf suite and see what the effect is, both when compiling hashmap-heavy
crates like Cargo as well as for rustc's own runtime which is a giant
hashmap benchmark.

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 27, 2019
@alexcrichton
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@alexcrichton
Copy link
Member Author

r? @ghost

@bors
Copy link
Contributor

bors commented Sep 27, 2019

⌛ Trying commit 7ce3096f0d38c102603fb84e24cfbecc93f9004e with merge 050928df2832b1d1304b9aa0e435fd60738583b7...

@bors
Copy link
Contributor

bors commented Sep 27, 2019

☀️ Try build successful - checks-azure
Build commit: 050928df2832b1d1304b9aa0e435fd60738583b7 (050928df2832b1d1304b9aa0e435fd60738583b7)

@rust-timer
Copy link
Collaborator

Queued 050928df2832b1d1304b9aa0e435fd60738583b7 with parent 590ae0e, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 050928df2832b1d1304b9aa0e435fd60738583b7, comparison URL.

@alexcrichton
Copy link
Member Author

Ok so the results here are pretty interesting I believe. As expected if you don't #[inline] everything the effect is twofold:

  • One is that rustc is slightly slower because we're not basically forcing everything to get inlined. The slowdown is worst in clean incremental builds (huge hashmap benchmarks) and hovers around 5%.
  • Another is that the compile time of some crates actually goes down I believe because there's less to inline. For example original performance measurements showed a 20% slowdown when compiling Cargo and a 10% slowdown compile hyper (all in optimized mode). Looks like cargo-opt has been removed but hyper-opt/regex-opt is 10% faster after this PR.

I think this is showing to me that its wortwhile to pursue this, although I think a few more methods need #[inline] to mitigate the performance impact on the compiler.

Some [recent analysis][analyze] shows that quite a lot of code is being
instantiated from `hashbrown` for rustc in particular, and so this is an
attempt to measure the impact of removing as much `#[inline]` as we can
from the `HashMap` implementation. The steps done to make this PR were:

* Start with `hashbrown`'s `master` branch
* Run `cargo bench`
* Remove all `#[inline]` annotations in the crate
* Run `cargo bench`
* Use results of `cargo bench` as well as profiling data to guide
  re-insertion of `#[inline]` on some functions. Repeat until there's
  rough parity with `master` branch's benchmarks

That's where we're at, so I'm curious now to run this through rust's
perf suite and see what the effect is, both when compiling hashmap-heavy
crates like Cargo as well as for rustc's own runtime which is a giant
hashmap benchmark.

[analyze]: https://rust-lang.zulipchat.com/#narrow/stream/187831-t-compiler.2Fwg-self-profile/topic/dumping.20sources.20of.20LLVM.20slowness/near/176740749
@alexcrichton
Copy link
Member Author

Ok I've done some profile comparisons locally and used that to guide the insertion of #[inline] in a few locations. I'm now curious to see the impact on perf numbers:

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Sep 30, 2019

⌛ Trying commit bd911ef with merge bbd2bf26b7d64cc96426b669126c4cb9f5330ade...

@bors
Copy link
Contributor

bors commented Sep 30, 2019

☀️ Try build successful - checks-azure
Build commit: bbd2bf26b7d64cc96426b669126c4cb9f5330ade (bbd2bf26b7d64cc96426b669126c4cb9f5330ade)

@rust-timer
Copy link
Collaborator

Queued bbd2bf26b7d64cc96426b669126c4cb9f5330ade with parent 22bc9e1, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit bbd2bf26b7d64cc96426b669126c4cb9f5330ade, comparison URL.

alexcrichton added a commit to alexcrichton/hashbrown that referenced this pull request Oct 1, 2019
This commit goes through and deletes almost all `#[inline]` annotations
in this crate. It looks like before this commit basically every single
function is `#[inline]`, but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some `#[inline]` annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like `Group` and `BitMask` aren't
candidates for inlining without `#[inline]`. Additionally LLVM is by no
means perfect, so some `#[inline]` may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

* Remove all `#[inline]` annotations.
* Run `cargo bench`, comparing against the `master` branch, and add
  `#[inline]` to hot spots as necessary.
* A [PR] was made against rust-lang/rust to [evaluate the impact][run1]
  on the compiler for more performance data.
* Using this data, `perf diff` was used locally to determine further hot
  spots and more `#[inline]` annotations were added.
* A [second round of benchmarking][run2] was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and `#[inline]` can always be selectively added back in easily without
adding it to every function in the crate.

[PR]: rust-lang/rust#64846
[run1]: rust-lang/rust#64846 (comment)
[run2]: rust-lang/rust#64846 (comment)
@alexcrichton
Copy link
Member Author

I've taken this data to create rust-lang/hashbrown#119

@alexcrichton alexcrichton deleted the less-inline-hashbrown branch October 1, 2019 13:04
@alexcrichton
Copy link
Member Author

cc @nnethercote, you might be interested in this and the companion PR to hashbrown

@nnethercote
Copy link
Contributor

This is an interesting one. There are some big wins on a few workloads, and small losses on many workloads, so it's a tricky one to evaluate. It's probably best to consider opt, debug and check builds separately. I will exclude the clean incremental numbers because they are less interesting.

  • opt builds have the best results. About 1/3 have big wins, up to 18%. About 1/3 have small losses, thought mostly less than 1%.
  • debug builds have a few wins (best is 5%), and lots of <1% losses.
  • check builds have a few wins (best is 5%), and lots of 1-3% losses.

A good thing about this is that opt and debug builds (but especially opt) are generally harder to improve than check builds. A bad thing is that check builds represent the front-end, and slowing down the front-end hurts pipelining.

Overall I'm leaning towards accepting this, at least from a rustc POV. It sure would be nice to have the tool proposed in rust-lang/measureme#51.

@alexcrichton
Copy link
Member Author

Excellent points! (and inspires me to open an issue). This is indeed basically directly targeted at opt builds basically. For debug builds #[inline] is effectively ignored and for all rustc operations this is likely to slightly regress since everything isn't forcibly inlined and different optimization decisions are being made.

alexcrichton added a commit to alexcrichton/hashbrown that referenced this pull request Oct 9, 2019
This commit goes through and deletes almost all `#[inline]` annotations
in this crate. It looks like before this commit basically every single
function is `#[inline]`, but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some `#[inline]` annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like `Group` and `BitMask` aren't
candidates for inlining without `#[inline]`. Additionally LLVM is by no
means perfect, so some `#[inline]` may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

* Remove all `#[inline]` annotations.
* Run `cargo bench`, comparing against the `master` branch, and add
  `#[inline]` to hot spots as necessary.
* A [PR] was made against rust-lang/rust to [evaluate the impact][run1]
  on the compiler for more performance data.
* Using this data, `perf diff` was used locally to determine further hot
  spots and more `#[inline]` annotations were added.
* A [second round of benchmarking][run2] was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and `#[inline]` can always be selectively added back in easily without
adding it to every function in the crate.

[PR]: rust-lang/rust#64846
[run1]: rust-lang/rust#64846 (comment)
[run2]: rust-lang/rust#64846 (comment)
bors added a commit to rust-lang/hashbrown that referenced this pull request Oct 15, 2019
Remove most `#[inline]` annotations

This commit goes through and deletes almost all `#[inline]` annotations
in this crate. It looks like before this commit basically every single
function is `#[inline]`, but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some `#[inline]` annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like `Group` and `BitMask` aren't
candidates for inlining without `#[inline]`. Additionally LLVM is by no
means perfect, so some `#[inline]` may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

* Remove all `#[inline]` annotations.
* Run `cargo bench`, comparing against the `master` branch, and add
  `#[inline]` to hot spots as necessary.
* A [PR] was made against rust-lang/rust to [evaluate the impact][run1]
  on the compiler for more performance data.
* Using this data, `perf diff` was used locally to determine further hot
  spots and more `#[inline]` annotations were added.
* A [second round of benchmarking][run2] was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

```
   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99
```

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and `#[inline]` can always be selectively added back in easily without
adding it to every function in the crate.

[PR]: rust-lang/rust#64846
[run1]: rust-lang/rust#64846 (comment)
[run2]: rust-lang/rust#64846 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants