-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline overlap based CGU merging #113777
Inline overlap based CGU merging #113777
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 78be95ec10df27dbd635a3a5a420b29b8272b4b5 with merge 53bc7211881d8549388de313642b05ad9f9f4f56... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (53bc7211881d8549388de313642b05ad9f9f4f56): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 657.644s -> 648.768s (-1.35%) |
Some good results:
|
78be95e
to
ea66bf9
Compare
@wesleywiser: this is ready for review! Meanwhile, let's do another perf run, just to see if anything changes. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit ea66bf9e6e3f29911168b53ff37e2a36c344ed06 with merge 8fe5a9539a402e69d0fed84df87f3645f271468a... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (8fe5a9539a402e69d0fed84df87f3645f271468a): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 658.015s -> 646.273s (-1.78%) |
The second perf run gives results similar to the first. |
@bors r+ |
📌 Commit ea66bf9e6e3f29911168b53ff37e2a36c344ed06 has been approved by It is now in the queue for this repository. |
Instead of repeatedly merging the two smallest CGUs, we now use a merging algorithm that aims to minimize the duplication of inlined functions. `exa-0.10.1` was one benchmark that saw particularly good results. The old CGU stats: ``` INTERNALIZE - unique items: 2774 (1216 root + 1558 inlined), unique size: 122065 (77219 root + 44846 inlined) - placed items: 3834 (1216 root + 2618 inlined), placed size: 154552 (77219 root + 77333 inlined) - placed/unique items ratio: 1.38, placed/unique size ratio: 1.27 - CGUs: 16, mean size: 9659.5, sizes: [11791, 11634, 11173, 10987, 10939, 10507, 9992, 9813, 9593, 9580, 9030, 8447, 7975, 7961, 7876, 7254] ``` The new CGU stats: ``` INTERNALIZE - unique items: 2774 (1216 root + 1558 inlined), unique size: 122065 (77219 root + 44846 inlined) - placed items: 3626 (1216 root + 2410 inlined), placed size: 147201 (77219 root + 69982 inlined) - placed/unique items ratio: 1.31, placed/unique size ratio: 1.21 - CGUs: 16, mean size: 9200.1, sizes: [11634, 10939, 10227, 9555, 9178, 9167, 8879, 8804, 8604, 8603 (x3), 8602 (x2), 8601, 8600] ``` The difference is in the number of inlined items. There are 1558 unique inlined items. With the old algorithm these were placed 2618 times, resulting in 1060 duplicates. With the new algorithm these were placed 2410 times, resulting in 852 duplicates. Also, the mean CGU size dropped from 9659.5 to 9200.1, and the CGU size distribution tightened, with the biggest one a little smaller and the smallest ones a little bigger.
ea66bf9
to
05de5d6
Compare
I fixed the typo. @bors r=pnkfelix |
☀️ Test successful - checks-actions |
1 similar comment
☀️ Test successful - checks-actions |
Finished benchmarking commit (0d6a9b2): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 656.931s -> 647.107s (-1.50%) |
In `base.rs`, tweak how the CGU size interleaving works. Since rust-lang#113777, it's much more common to have multiple CGUs with identical sizes. With the existing code these same-sized items ended up in the opposite-to-desired order due to the stable sorting. The code now starts with a reverse sort (like is done in `partitioning.rs`) which gives the behaviour we want. This doesn't matter much for perf, but makes profiles in `samply` look more like what we expect. In `partitioning.rs`, we can use `sort_by_key` instead of `sort_by_cached_key` because `CGU::size_estimate()` is cheap. (There is an identical CGU sort earlier in that function that already uses `sort_by_key`.)
…nkfelix Tweak CGU sorting in a couple of places. In `base.rs`, tweak how the CGU size interleaving works. Since rust-lang#113777, it's much more common to have multiple CGUs with identical sizes. With the existing code these same-sized items ended up in the opposite-to-desired order due to the stable sorting. The code now starts with a reverse sort (like is done in `partitioning.rs`) which gives the behaviour we want. This doesn't matter much for perf, but makes profiles in `samply` look more like what we expect. In `partitioning.rs`, we can use `sort_by_key` instead of `sort_by_cached_key` because `CGU::size_estimate()` is cheap. (There is an identical CGU sort earlier in that function that already uses `sort_by_key`.) r? `@pnkfelix`
…nkfelix Tweak CGU sorting in a couple of places. In `base.rs`, tweak how the CGU size interleaving works. Since rust-lang#113777, it's much more common to have multiple CGUs with identical sizes. With the existing code these same-sized items ended up in the opposite-to-desired order due to the stable sorting. The code now starts with a reverse sort (like is done in `partitioning.rs`) which gives the behaviour we want. This doesn't matter much for perf, but makes profiles in `samply` look more like what we expect. In `partitioning.rs`, we can use `sort_by_key` instead of `sort_by_cached_key` because `CGU::size_estimate()` is cheap. (There is an identical CGU sort earlier in that function that already uses `sort_by_key`.) r? ``@pnkfelix``
…nkfelix Tweak CGU sorting in a couple of places. In `base.rs`, tweak how the CGU size interleaving works. Since rust-lang#113777, it's much more common to have multiple CGUs with identical sizes. With the existing code these same-sized items ended up in the opposite-to-desired order due to the stable sorting. The code now starts with a reverse sort (like is done in `partitioning.rs`) which gives the behaviour we want. This doesn't matter much for perf, but makes profiles in `samply` look more like what we expect. In `partitioning.rs`, we can use `sort_by_key` instead of `sort_by_cached_key` because `CGU::size_estimate()` is cheap. (There is an identical CGU sort earlier in that function that already uses `sort_by_key`.) r? ```@pnkfelix```
…nkfelix Tweak CGU sorting in a couple of places. In `base.rs`, tweak how the CGU size interleaving works. Since rust-lang#113777, it's much more common to have multiple CGUs with identical sizes. With the existing code these same-sized items ended up in the opposite-to-desired order due to the stable sorting. The code now starts with a reverse sort (like is done in `partitioning.rs`) which gives the behaviour we want. This doesn't matter much for perf, but makes profiles in `samply` look more like what we expect. In `partitioning.rs`, we can use `sort_by_key` instead of `sort_by_cached_key` because `CGU::size_estimate()` is cheap. (There is an identical CGU sort earlier in that function that already uses `sort_by_key`.) r? `@pnkfelix`
I imagine that repeatedly merging Out of curiosity @nnethercote, did you see anything while working on this that would suggest the results are, or are not, very stable with respect to small changes in the number of codegen units? |
I have no data one way or the other. I never tried changing the default N away from 16. |
Introduce a new CGU merging algorithm that aims to minimize the number of duplicated inlined items.
r? @wesleywiser