-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIR per-block copy elimination #86430
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @petrochenkov (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
Effect of this pass works the best with the following "SimplifyLocals" pass.
and then "SimplifyLocals" eliminates a lot of locals
Logic of the pass is simple itself:
I think that local state analyzer can be simplified, and I'd want to rework the "safeguard" logic, but first would like to check if this pass most likely sound and is worth it. May be @oli-obk or @bjorn3 can review it as they have reviewed my previous PR and it's kind of logical step after it |
What if a reference to a local is taken and then the local is mutated through this reference? With fixes for this (and maybe related cases) I think it is a worthwhile optimization. |
I've identified this weakness and few other, will make an update soon. I expect it to be quite expensive pass, but it should eliminate quite a lot |
r=@bjorn3 most likely it's in a good shape to try for performance impact |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit c2245e2c8519d40fa967fb5d1bcca17ba7f639b1 with merge 28c1cfbaee6d9198e43621666275d1d53336d5bd... |
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
Hm, that is interesting. In theory, if I have a copy into local like _2 = Expr(_1) and _1 is not mutated or dropped until the end of the block then what can invalidate _1, so that expression can not be propagated further as substitutions for _2? Obvious cases of mutation of _1 are rejected to be optimized already. Case when _2 itself is mutated in some way is discarded by this pass (mutation of the local copy). Between the place of copy into _2 and the substitution it's only allowed to copy _2, and make non-use actions. |
r? @tmiasko |
r? @bjorn3 |
r? @tmiasko I don't feel confident enough in reviewing this PR. It is way too susceptible to causing miscompilations. |
This comment has been minimized.
This comment has been minimized.
r=@bjorn3 can you start a CI job please? Looks better locally, tests pass, for now it's less zealous with optimizations, will tune later on |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit a4d288d8387af1a8fb43cf713683f9e2af30a5ae with merge 0a02c3670999e4954a874ff832bcc92354eb47cf... |
Oh, you just wanted CI, not a perf run. My bad. Still can't hurt to get a perf run I guess. |
Well, it could have been indirectly touched. As mentioned above it's strictly not one MIR pass but two: lower slice::len and this copy elimination. For a clear measurement I'd need the first to be merged and then rebase |
depends on what you mean by indirectly. It could just be random LLVM decisions making rustc slower, even on stage 1. We get this kind of noise pretty often. but there is indeed one real regression https://perf.rust-lang.org/detailed-query.html?commit=0a02c3670999e4954a874ff832bcc92354eb47cf&base_commit=9543636cd6609cd2a9880da8f93aac96192ae40a&benchmark=html5ever-opt&run_name=full |
This may be logical: if we eliminate intermediates that have made some parts of code look "far" from each other then we can get more work for LTO In any case I only consider LLVM as a black box that works somewhere "after" MIR, so less lines in MIR was kind of a goal (even while you only see a reduction in number of lines after "SimplifyLocals" actually runs) |
For completeness I should also note that this pass is kind of lightweight (O(locals * blocks) instead of O(locals^2 * blocks)) and most likely sound analog of https://github.com/rust-lang/rust/blob/master/compiler/rustc_mir/src/transform/dest_prop.rs Most likely sound because it only works on the block level and replaces destinations with sources in case if destination was not messed up between copy and potential substitution place, and if destination is not dead by the end of the block then last copy statement is preserved as it may be used in other blocks. It's also conservative on what messes up with a source and where to substitute the destination |
How much is this getting from being quadratic? I agree that pattern is super-common, but I wonder if perhaps most of the value it gives could be done in a single walk -- rather than one per local -- by just looking at the most recent assignment. (And if the use is a |
I think my estimate of O(locals * blocks) is not 100% correct, it's still quadratic, but with small constants:
May be destination propagation is better in global, but it's under unsound + mir-opt-level=3 and this on is intended to be simpler and sound |
@rustbot label: +S-blocked |
☔ The latest upstream changes (presumably #86383) made this pull request unmergeable. Please resolve the merge conflicts. |
aee3c6f
to
f9c6cc9
Compare
@rustbot label: -S-blocked |
☔ The latest upstream changes (presumably #86143) made this pull request unmergeable. Please resolve the merge conflicts. |
_3 = ((_2 as Some).0: i32); // scope 0 at $DIR/issue-73223.rs:3:14: 3:15 | ||
_1 = _3; // scope 2 at $DIR/issue-73223.rs:3:20: 3:21 | ||
_1 = ((_2 as Some).0: i32); // scope 2 at $DIR/issue-73223.rs:3:20: 3:21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised our DCE didn't get rid of _3
, situations like this could explain why we don't see much gain in the perf runner, all the dead code is still passed to LLVM and eliminated there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... maybe we don't have any pass that removes useless assignments? That should be a fairly easy pass... just walk every block backwards and remember any StorageDead
. As long as you don't see a read from the locals that have a StorageDead
, then you can remove assignments to the local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did see a result of the pass (SimplifyLocals I think) that does what you say - if there is a variable that is just Live - copy-assign- Dead then it gets removed. The problem is that such patters it not present in MIR by default and only emerges e.g. as a result of this pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think there is a pass that does what you say (not a pass from this PR), but it's under opt-level = 4 and also under "unsound mir optimizations". While this pass basically does what you say with some extra heuristics like copying into the variable more then once, and handling a case when it's not dead by the end of the block.
So... I tried to review your pass's logic, but ran out of time in the hour I allocated for it. I don't think your categorization of "simple" fits what it is, as it is rather complex and it's not trivial to reason about its correctness (at least not for me). I don't know how to keep reviewing this pass and whether the maintenance effort to keep it sound and working is worth it when we could spend that effort on getting destination propagation bug-free and get a more powerful optimization out of it. |
I've made this pass because it's for sure much more trivial than full destination propagation and it's easier to make it sound. Regarding the logic, it's simple
This allows to handle cases like
made into
that is later simplified by SimplifyLocals, as well as
that is replaced only as
|
ping from triage: |
ping from triage: |
There were many attempts to cleanup a MIR patterns like
This is a simple pass that only works per basic blocks and doesn't try to do harder liveness analysis on a body level.
It also depends on my work from PR #86383 , so I'll rebase when it's accepted.