-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less codegen parallelism than expected with -C codegen-units=16
#64913
Comments
the problem is the single threaded mir to llvm-ir codegen that can not provide work for all the threads in the parallell llvm-ir to machinecode codegen, as this is a debug build there is not that much that llvm do resulting in low cpu usage maybe is better for a release build thread0 is showing the mir to llvm-ir and the other is showing parallell work not reel threads doing the llvm-ir to machinecode codegen. the image is the result of using the -Z self-profile option to rustc and then filter out a lot of traces from the result of crox --compact_thread due to chrome was not able to show the 1.4 GB json file on my computer :) |
Thanks for looking into it @andjo403! Below is |
This is actually something you may want to investigate with the Translation to LLVM IR is currently single-threaded, and so in debug mode we're firing off LLVM modules to get codegen'd in the background, but they codegen so quickly that's why it seems to never get cpu usage off the ground (most time taken is spent translating to LLVM IR). The release mode case is similar, but has one more synchronization point. Instead of going directly to codegen the phases look like:
From the start of This looks very familiar to a case I diagnosed a long time ago and I explained how I found that awhile back. We don't have |
did a run with -Z self-profile for the release build and @alexcrichton is correct there is one cgu that is extra long I do not know what the cgu contains due to I do not know how to only build the script crate to save the llvm-ir files. |
Man those graphs are so cool! It's so visually appalling what's happening there. You can easily see the waterfall of codegen units being spawned for optimization, you can see the brief serial period, and then everything getting optimized at once. Neato! But yeah so from that we can clearly see that the ThinLTO passes are taking forever on one particular codegen unit, and the normal optimization passes are taking ~2x longer than all other CGUs. All that needs to be done is to go find the name of that CGU in the rlib and dig it out. The best way to do that is probably to:
In general though I don't really know great ways to dig into this. I'm not sure how to take a slow CGU in LLVM and then diagnose why it's slow. Step one is getting the LLVM IR out, but that's difficult enough to do with rustc. One thing that may also work is to compile with |
made a little POC with adding the clang time-trace feature in the self-profiling and with that I can see that 34% of the time in llvm for the long cgu is compiling |
That likely comes from huge numbers of generated functions from this code generator calling this panic catching helper. |
If a single generic function has many instantiations, do they all end up in the same CGU? |
I think that all generic functions that is instantiated in the same module is in the same CGU if I understand the description here https://github.com/rust-lang/rust/blob/master/src/librustc_mir/monomorphize/partitioning.rs#L44 @michaelwoerister do you have some idea of how to split the CGU more? rustc also have some generic functions that make one CGU larger then the rest |
have been looking in to more detail what happen in the partitioning for script and here is the stats
so some conclusions from the data is that:
|
What is CGU merging? |
it is part of the partitioning algorithm when either |
So, does "unmerged" mean that one unit for the purpose of codegen only contains one unit for the purpose of incremental? |
yes "unmerged" was an CGU that have not been merged during the merge phase will rename it to "not merged" |
add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913
add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913
add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913
have made #65281 to handle the first of my points from above. |
hmm seems like some one tested the lto idea already #65052 |
I wonder if it is possible to not only look at the size of the CGUs but also what inline items that are common to get more items with internal linkage and less duplication between CGUs |
for a more even partitioning inline before merge consider the size of the inlined items for a more even partitioning for me this change take the compile time for script-servo-opt from 306s to 249s edit: the times is for a 32 thread cpu. cc #64913
This is the output of
cargo build -Z timings
for Servo, on a 14 cores / 28 threads machine, filtered with "Min unit time" at 10 seconds.Edit: with Rust
nightly-2019-09-28
For most of codegen for the
script
crate only 3 to 4 threads seem to be CPU-bound, leaving other cores idle, despitecodegen-units=16
being the default. (Results are similar if I specify it explicitly.)Shouldn’t CPU usage be much closer to
min(ncpu, codegen-units)
?The text was updated successfully, but these errors were encountered: