Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThinLTO + O2 build time regression in rust-doom #45188

Closed
alexcrichton opened this issue Oct 10, 2017 · 7 comments
Closed

ThinLTO + O2 build time regression in rust-doom #45188

alexcrichton opened this issue Oct 10, 2017 · 7 comments
Labels
A-codegen Area: Code generation

Comments

@alexcrichton
Copy link
Member

Creating an issue for https://internals.rust-lang.org/t/help-test-out-thinlto/6017/3 for more investigation.

cc @cristicbz

@michaelwoerister
Copy link
Member

We should definitely re-investigate this with different settings for -Z inline-in-all-cgus since I don't think that existed yet when the initial benchmark was done.

@alexcrichton
Copy link
Member Author

Ok here's what I did:

Results:

Build entire project from scratch

Here I just ran cargo build from a completely clean target directory and timed how long it took.

cgus 1 2 4 8 16 32 64
debug #1 79.414 68.783 64.883 65.293 64.567 64.743 66.243
debug #2 80.465 70.181 65.977 64.997 65.252 65.869 66.912
release thinlto=no #1 175.212 131.287 117.496 114.854 114.018 114.566 114.391
release thinlto=no #2 171.829 131.514 119.942 115.897 112.881 113.379 114.711
release thinlto=yes #1 172.188 174.210 161.224 159.588 154.327 154.382 161.319
release thinlto=yes #2 171.179 172.778 160.880 156.812 157.029 156.233 156.742

The conclusion from this is that I get clear wins across the board on my machine for using more CGUs. The sweet spot appears around 16 CGUs, and I don't see any clear/obvious regressions in this area for using multiple CGUs.

Build project after changing the math crate

Here I changed the math crate to force a recompilation of that crate and up. All other dependencies were cached

cgus 1 2 4 8 16 32 64
debug #1 18.573 14.590 12.318 11.827 11.733 11.936 12.206
debug #2 18.558 14.533 12.246 11.904 11.659 11.991 12.250
release thinlto=no #1 59.602 35.658 23.906 22.403 21.920 21.862 21.932
release thinlto=no #2 60.016 35.438 24.806 22.935 21.835 21.914 21.902
release thinlto=yes #1 59.449 46.711 33.257 33.333 31.071 31.026 31.190
release thinlto=yes #2 60.491 47.160 33.320 33.021 31.036 30.964 30.347

Here we see even more drastic wins (as is expected) and again no clear regressions from one CGU.


@michaelwoerister

I haven't done much data collection of inline-in-all-cgus in release mode, but @sfackler posted some worrisome numbers which indicate that -Z inline-in-all-cgus=no actually regresses release build time. Presumably it shifts more work to post-LTO than the work takes in pre-LTO?

@alexcrichton
Copy link
Member Author

In the thread about testing ThinLTO some main conclusions I had was:

  • Enabling multiple CGUs and ThinLTO requires the compiler to perform more work overall. In general there's just some more management, some synchronization, and especially handling of #[inline] in release mode.
  • It's not a 100% win for all projects. Some got over 2x faster to compile, some got a few dozen or so percent slower.
  • I've personally found locally that the biggest improvements come when compiling incrementally when there are fewer crates in play (more empty cores available)
  • I've personally never been able to reproduce any regressions on my machine, I've always seen wins across the board

With that in mind I'm sort of tempted to close this and keep pushing hard on #45320, but I'm curious if others can reproduce my own results!

@cristicbz
Copy link
Contributor

I'm doing another run with the newest nightly, more values for the number of cgus and inline-in-all-cgus. I'll report back for posterity's sake.

I've personally never been able to reproduce any regressions on my machine, I've always seen wins across the board

I want that engraved on my tombstone :P

@alexcrichton
Copy link
Member Author

Fair point!

@cristicbz
Copy link
Contributor

So using your script (thanks btw!), this what I've got (with rustc 1.22.0-nightly (4e9527cf6 2017-10-16)):

config / cgus 1 2 4 8 16 32
fresh, debug #1 84.241 78.396 66.445 64.908 66.476 67.808
fresh, debug #2 84.017 70.373 66.309 66.053 65.734 68.415
fresh, release #1 164.202 135.773 129.870 122.565 114.855 116.007
fresh, release #2 163.273 143.971 121.138 124.187 113.137 115.537
fresh, release, thinlto #1 169.568 171.795 163.108 178.536 156.117 158.511
fresh, release, thinlto #2 165.670 176.277 162.612 173.914 155.648 165.749
math, debug #1 20.781 15.920 14.698 14.141 14.056 14.349
math, debug #2 20.737 15.901 14.717 14.056 14.101 14.400
math, release #1 64.861 37.807 25.958 25.829 23.400 23.247
math, release #2 64.856 37.211 25.930 26.070 23.292 23.264
math, release, thinlto #1 62.796 51.280 35.819 36.367 32.440 32.596
math, release, thinlto #2 71.395 50.464 36.009 35.762 32.796 33.840

This results match yours more closely than before, with modest or no improvements in the CGU 8-32 regime for the full rebuild and significant speed-ups in the single crate case. There is no difference in runtime benchmarks.

So what happened before? I'll put it down to thermal throttling: all runs took much longer, and maybe with more cores in use things got hotter. Sorry for the noise, it sounds like you can close this issue!

@alexcrichton
Copy link
Member Author

Ok awesome! Thanks so much for gathering that data! In that case I'll...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation
Projects
None yet
Development

No branches or pull requests

3 participants