-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][interp] Add SSA transformation and SSA based optimization #96315
Conversation
Tagging subscribers to this area: @BrzVlad, @kotlarmilos Issue DetailsThis PR makes the interpreter optimizing compiler leverage SSA form. Before this change the interpreter was running optimizations mainly at a basic block level. As we iterated each instruction in a basic block we were updating the current value of a var and used it for various optimizations. Also we were keeping track of global ref/use counts for each var that were useful to determine if we can kill a variable. While this PR was indended to be a prequel for future optimizations, some optimizations come naturally with it. With SSA transformation, vars now have a global value which can be used by optimizations outside of a single basic block boundary. In addition to this, since we rename every single redefinition of a var, we can now easily optimize out unused definitions. This is is particularly relevant for optimizing out MINT_INITLOCALS, opcode which was present in most of the methods before. SSA transformation stepsSSA transformation follows several steps. We do a DFS traversal of all basic blocks and assign a dfs_index to each bblock. This is relevant for then computing the immediate dominator and then the dominance frontier for each bblock. We will then traverse the all instructions to determine which vars are relevant for SSA transformation. These are vars that are defined multiple times or used in multiple basic blocks. We then compute live_in/live_out information about this subset of global vars for each basic block. This information is used for generating pruned SSA (when we insert PHI nodes, we don't insert a phi node for a var in a bblock, if that var is not live at bblock entry). Once we have PHI nodes inserted, we scan the bblocks in the dominator tree order and we rename each var when it gets redefined, while tracking the current renamed var for it, and renaming all following uses. The code right now is in valid SSA form and it is ready to run all optimizations on it. When we are done with all optimizing passes, we exit SSA code. This means clearing out PHI opcodes and renaming back fixed SSA vars (*). Additional var tablesInterp variables are identified by an index into the global Types of varsWhen SSA transformation is complete, optimizations can encounter the following types of vars (as operands to instructions):
Exceptional pathBefore this PR, the linking of bblocks from exception handlers was rather arbitrary, not following a clear plan, with many bblocks not having correctly set OUT bblocks. This was problematic for SSA transformation, which expects a well formed CFG. This PR makes it such that the CFG only contains the non-exceptional PATH. No exception handlers (not even finally blocks) are linked to the rest of the bblocks. All variables used in an exception handler will be marked as no-ssa. We will still run optimizations to code in handlers, but we will not do any cross-bblock propagation of values. The PR is best reviewed with each individual commit. Each commit follows an incremental approach to the problem and additional explanations are present in the description/code comments. From what I could tell in the methods that I tested, this PR leads to roughly 3-12% improvement in code size/performance. Running the optimizations seems to take about 4x as much as before, this could likely be improved.
|
8a65bac
to
ef6ccfc
Compare
b5cba48
to
14f503a
Compare
de7c244
to
ee18cc0
Compare
b633527
to
4285632
Compare
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
4285632
to
ac28106
Compare
The commits up-to and includling '[Remove flags and use bit fields instead]' look ok to me, maybe they can be merged to reduce the size of this PR. |
@vargaz #97249. I initially planned to completely replace old optimization with SSA based ones. Over the course of the implementation I stumbled across several scenarios where SSA is awkward (weirdly linked bblocks in CFG, bblocks from exception handlers, huge methods where SSA transformation takes ages). All these scenarios are properly handled by having a no-ssa mode. So default is SSA enabled, but SSA can be disabled for the entire method ( |
ac28106
to
b09c0d2
Compare
@dotnet/jit-contrib Not sure if this is the proper way to tag. Feel free to take a look over this, mainly at the descriptions and comments. |
Is this about propagating the use of SSA vars (that have phi nodes) in other bblocks ?
Yes. There are two main scenarios where this is helpful. First, if a var has its address taken, then no optimizations can be applied to it, also that var won't be in SSA form. If a var is no longer indirect, this means that a lot of the work would need to be redone (like inserting phi nodes and doing renaming for the var in question). Also, following optimizations, the CFG can change, me might no longer need to insert phi nodes in some cases, potentially generating better code. While it is not optimal, for simplicity, we do completely rebuild SSA.
Well, currently, the only scenario where we forcefully enter no-ssa mode is when we have a throw that is caught in the same method. We don't have the bblocks properly linked to the end of the catch handler, so preliminary CFG related algorithms for SSA don't work. For simplicity we just disable SSA transformation, since these cases shouldn't be common. The no_ssa restriction can also apply to individual variables, and it pretty much limits the optimizations on that var to be within the same bblock where the var is defined. So we still run a pretty decent amount of optimizations in these cases. Note that even in isolated code like |
I think what Andy is referring to is that today, the JIT treats locals that are used across blocks the same as locals that are used only within single blocks. We could have the single-block locals not count towards the tracking limit and not spend any time during interblock dataflow computations on them (e.g. during liveness), which would be more efficient. We have an open issue about it -- #72740.
I think Andy has mentioned it before, but we do not have an out-of-SSA phase in the JIT, so if I understand it correctly we essentially only have what you describe as "Renamed fixed SSA vars". We are conservative in some places due to it, e.g. runtime/src/coreclr/jit/copyprop.cpp Lines 209 to 231 in e5cebac
It sounds like you do allow this with your "Normal SSA vars". I'm curious how you eliminate phis for them -- do you have an out-of-SSA phase? Or have I misunderstood what they represent? It seems like our liveness representations are a bit different. For us, we store liveness of all the tracked locals at the boundaries of every basic block. In addition, we store single-bit annotations on uses of locals that indicate whether or not this is a last use. It is not possible to efficiently query liveness for a particular local at an arbitrary point within a basic block (which I assume your |
The example from the comments is actually something I didn't think about initially and later had to adjust the implementation. So consider this code:
x is dead at the point of calling print. There are two situations here. First case is if x is a "tracked" var with no restrictions. This means that The second scenario is when x is what I call a fixed var (this happens if somewhere else in the code there was a phi node introduced for x, or if x is part of the locals state at a tiering patchpoint location). If that is the case, we no longer have the freedom of splitting up x into separate vars x0 and x1. The out of ssa phase for these vars means that they will be renamed back to the original var x. The way to then prevent incorrect cprop is by having liveness extension limit for the renamed vars. This is stored either as a bitset In this particular example, before |
One thing I've wanted to do in RyuJit is to rename disjoint SSA webs right after building SSA. That is, like you mention above, ssa defs are tied together if they appear in the same phi. So after building SSA we'd find the transitive closure of the "in same phi" relation to build these webs; they exhaustively partition the SSA defs. Each web then gets renamed. Having too many locals has a cost for us so I was going to temper this with a second relation, so that if the last use of a def feeds into another def we also tie those defs together. So something like a long chain of The hope is that this would both simplify life for the allocator and provide a bit more flexibility for the optimizer. We don't currently track the uses of each def so this rewriting requires some more bookkeeping than we do right now. This also messes up our optimized code debugging story somewhat, as it's now possible for a user local to have multiple values at a program point, but we've never made any guarantees about optimized code debuggability and we don't do that much code motion, so it's likely not a big degradation in debug info quality. |
So am I understanding correctly that, currently on RyuJIT, all renamed SSA vars for an original var will be renamed back to the original var when exiting SSA, this being the reason for the above example not being handled optimally ? Even if there are no phi nodes involved with these renames ? |
Might be a jiterpreter bug exposed by these changes, too |
This one's now meaningfully lower, so it looks like SSA also improved performance on wasm as we'd expect |
This PR makes the interpreter optimizing compiler leverage SSA form. Before this change the interpreter was running optimizations mainly at a basic block level. As we iterated each instruction in a basic block we were updating the current value of a var and used it for various optimizations. Also we were keeping track of global ref/use counts for each var that were useful to determine if we can kill a variable. While this PR was indended to be a prequel for future optimizations, some optimizations come naturally with it. With SSA transformation, vars now have a global value which can be used by optimizations outside of a single basic block boundary. In addition to this, since we rename every single redefinition of a var, we can now easily optimize out unused definitions. This is is particularly relevant for optimizing out MINT_INITLOCALS, opcode which was present in most of the methods before.
SSA transformation steps
SSA transformation follows several steps. We do a DFS traversal of all basic blocks and assign a dfs_index to each bblock. This is relevant for then computing the immediate dominator and then the dominance frontier for each bblock. We will then traverse the all instructions to determine which vars are relevant for SSA transformation. These are vars that are defined multiple times or used in multiple basic blocks. We then compute live_in/live_out information about this subset of global vars for each basic block. This information is used for generating pruned SSA (when we insert PHI nodes, we don't insert a phi node for a var in a bblock, if that var is not live at bblock entry). Once we have PHI nodes inserted, we scan the bblocks in the dominator tree order and we rename each var when it gets redefined, while tracking the current renamed var for it, and renaming all following uses. The code right now is in valid SSA form and it is ready to run all optimizations on it. When we are done with all optimizing passes, we exit SSA code. This means clearing out PHI opcodes and renaming back fixed SSA vars (*).
Additional var tables
Interp variables are identified by an index into the global
td->vars
table. This table contains basic information for each var. This table can get very large and it is problematic for SSA algorithms which only need to operate on a subset of these vars (most vars are created from simple push/pop operations from IL and they are already in SSA form). We create 2 new tables with information attached to an interp var via anext_index
. The first table isrenamable_vars
. This is a subset of the list of original vars that will end up being renamed. It also contains additional information to be used when renaming vars (ex renamed var stack) and it also provides a way for a renamed var to reference the original var that it is renamed from. We also create a new tablerenamed_fixed_vars
that contains additional liveness information necessary for each renamed var of a fixed original var.Types of vars
When SSA transformation is complete, optimizations can encounter the following types of vars (as operands to instructions):
ldloca
. They will not be in SSA form, they can have multiple definitions and we will not do any type of optimizations to them. We will still apply optimization to the result ofldloca
so the address can end up not being used and bblocks continaingldloca
can be optimized out. In these cases, some vars can no longer be indirect once we finish the optimization step. If that happens, after we exit SSA, we go back in, this time applying optimizations to these vars.live_out_bblocks
(bblocks where the renamed var is live at the end of the bblock) andlive_limit_bblocks
(for each bblock we remember the last instruction index where the renamed var is still alive). Whenever we apply an optimization to a fixed renamed var, we will ensure that the new var liveness is not outside of these precomputed ranges for the current bblock we are in. This ensures that we will not have overlapping liveness between two renamed fixed vars. Fixed renamable vars are necessary for the following reasons:Exceptional path
Before this PR, the linking of bblocks from exception handlers was rather arbitrary, not following a clear plan, with many bblocks not having correctly set OUT bblocks. This was problematic for SSA transformation, which expects a well formed CFG. This PR makes it such that the CFG only contains the non-exceptional PATH. No exception handlers (not even finally blocks) are linked to the rest of the bblocks. All variables used in an exception handler will be marked as no-ssa. We will still run optimizations to code in handlers, but we will not do any cross-bblock propagation of values.
The PR is best reviewed with each individual commit. Each commit follows an incremental approach to the problem and additional explanations are present in the description/code comments. From what I could tell in the methods that I tested, this PR leads to roughly 3-12% improvement in code size/performance. Running the optimizations seems to take about 3x as much as before, this could likely be improved.