Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][interp] Add SSA transformation and SSA based optimization #96315

Merged
merged 45 commits into from
Feb 10, 2024

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Dec 26, 2023

This PR makes the interpreter optimizing compiler leverage SSA form. Before this change the interpreter was running optimizations mainly at a basic block level. As we iterated each instruction in a basic block we were updating the current value of a var and used it for various optimizations. Also we were keeping track of global ref/use counts for each var that were useful to determine if we can kill a variable. While this PR was indended to be a prequel for future optimizations, some optimizations come naturally with it. With SSA transformation, vars now have a global value which can be used by optimizations outside of a single basic block boundary. In addition to this, since we rename every single redefinition of a var, we can now easily optimize out unused definitions. This is is particularly relevant for optimizing out MINT_INITLOCALS, opcode which was present in most of the methods before.

SSA transformation steps

SSA transformation follows several steps. We do a DFS traversal of all basic blocks and assign a dfs_index to each bblock. This is relevant for then computing the immediate dominator and then the dominance frontier for each bblock. We will then traverse the all instructions to determine which vars are relevant for SSA transformation. These are vars that are defined multiple times or used in multiple basic blocks. We then compute live_in/live_out information about this subset of global vars for each basic block. This information is used for generating pruned SSA (when we insert PHI nodes, we don't insert a phi node for a var in a bblock, if that var is not live at bblock entry). Once we have PHI nodes inserted, we scan the bblocks in the dominator tree order and we rename each var when it gets redefined, while tracking the current renamed var for it, and renaming all following uses. The code right now is in valid SSA form and it is ready to run all optimizations on it. When we are done with all optimizing passes, we exit SSA code. This means clearing out PHI opcodes and renaming back fixed SSA vars (*).

Additional var tables

Interp variables are identified by an index into the global td->vars table. This table contains basic information for each var. This table can get very large and it is problematic for SSA algorithms which only need to operate on a subset of these vars (most vars are created from simple push/pop operations from IL and they are already in SSA form). We create 2 new tables with information attached to an interp var via an ext_index. The first table is renamable_vars. This is a subset of the list of original vars that will end up being renamed. It also contains additional information to be used when renaming vars (ex renamed var stack) and it also provides a way for a renamed var to reference the original var that it is renamed from. We also create a new table renamed_fixed_vars that contains additional liveness information necessary for each renamed var of a fixed original var.

Types of vars

When SSA transformation is complete, optimizations can encounter the following types of vars (as operands to instructions):

  • Indirect vars. These are vars that are a source var toldloca. They will not be in SSA form, they can have multiple definitions and we will not do any type of optimizations to them. We will still apply optimization to the result of ldloca so the address can end up not being used and bblocks continaing ldloca can be optimized out. In these cases, some vars can no longer be indirect once we finish the optimization step. If that happens, after we exit SSA, we go back in, this time applying optimizations to these vars.
  • No SSA vars that are not indirect. Currently these are variables that are referenced from EH handlers. These are variables that will be optimized pretty much the same way as before this PR. They will have their value tracked over a single basic block and their value will not be propagated from another basic block.
  • Normal SSA vars. These are vars that renamed from non-fixed vars or vars that were not marked as no-ssa and were not renamable (aka vars that were already in SSA form from the start). This is the most common set of vars, that supports all optimizations. Aside from taking part in cross bblock optimizations, another difference from no-ssa vars is that they can actually have their value stored in a different order. (For no-ssa vars we preserve the store order since an exception can be thrown at any point). Renamed vars of these type will not be renamed back, they will continue to be used as standalone vars once we exit SSA form.
  • Renamed fixed SSA vars. These are vars that are renamed from a fixed renamable var. Renamed vars from a fixed renamable var must all have the same offset. This is implemented by reverting the renaming of all vars once we exit SSA, meaning that, in the final code, we will end up using the same original fixed renamable var. All optimizations applied to these vars will have to respect this offset allocation constraint. This translates to the following condition: at any point in the code, there must be a single live value of any of the renamed vars of the same fixed renamable var. This is implemented by having additional liveness information computed as we do the var renaming, specifically live_out_bblocks (bblocks where the renamed var is live at the end of the bblock) and live_limit_bblocks (for each bblock we remember the last instruction index where the renamed var is still alive). Whenever we apply an optimization to a fixed renamed var, we will ensure that the new var liveness is not outside of these precomputed ranges for the current bblock we are in. This ensures that we will not have overlapping liveness between two renamed fixed vars. Fixed renamable vars are necessary for the following reasons:
    • the theoretical implementation of phi nodes would be to fetch the correct value of the var from each incoming bblock. This would mean inserting moves from each incoming edge. This would be extremely expensive, especially for the interpreter. The alternative solution to this is to simply remove the phi nodes when exiting SSA, while ensuring all these vars (vars that are args to the phi opcode) are allocated to the same offset, so therefore no moving is necessary.
    • vars that are part of the patchpoint state during tiering have to be fixed. When we tier up a method from a patchpoint, we copy the entire IL locals space. The optimized method needs to be able to obtain the values of the IL locals from the same offset as in the unoptimized method. This means that all vars that are live at the location of a patchpoint will be marked as fixed (luckily for us, we already have this liveness information readily available as part of the pruned SSA computation).

Exceptional path

Before this PR, the linking of bblocks from exception handlers was rather arbitrary, not following a clear plan, with many bblocks not having correctly set OUT bblocks. This was problematic for SSA transformation, which expects a well formed CFG. This PR makes it such that the CFG only contains the non-exceptional PATH. No exception handlers (not even finally blocks) are linked to the rest of the bblocks. All variables used in an exception handler will be marked as no-ssa. We will still run optimizations to code in handlers, but we will not do any cross-bblock propagation of values.



The PR is best reviewed with each individual commit. Each commit follows an incremental approach to the problem and additional explanations are present in the description/code comments. From what I could tell in the methods that I tested, this PR leads to roughly 3-12% improvement in code size/performance. Running the optimizations seems to take about 3x as much as before, this could likely be improved.

@ghost
Copy link

ghost commented Dec 26, 2023

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR makes the interpreter optimizing compiler leverage SSA form. Before this change the interpreter was running optimizations mainly at a basic block level. As we iterated each instruction in a basic block we were updating the current value of a var and used it for various optimizations. Also we were keeping track of global ref/use counts for each var that were useful to determine if we can kill a variable. While this PR was indended to be a prequel for future optimizations, some optimizations come naturally with it. With SSA transformation, vars now have a global value which can be used by optimizations outside of a single basic block boundary. In addition to this, since we rename every single redefinition of a var, we can now easily optimize out unused definitions. This is is particularly relevant for optimizing out MINT_INITLOCALS, opcode which was present in most of the methods before.

SSA transformation steps

SSA transformation follows several steps. We do a DFS traversal of all basic blocks and assign a dfs_index to each bblock. This is relevant for then computing the immediate dominator and then the dominance frontier for each bblock. We will then traverse the all instructions to determine which vars are relevant for SSA transformation. These are vars that are defined multiple times or used in multiple basic blocks. We then compute live_in/live_out information about this subset of global vars for each basic block. This information is used for generating pruned SSA (when we insert PHI nodes, we don't insert a phi node for a var in a bblock, if that var is not live at bblock entry). Once we have PHI nodes inserted, we scan the bblocks in the dominator tree order and we rename each var when it gets redefined, while tracking the current renamed var for it, and renaming all following uses. The code right now is in valid SSA form and it is ready to run all optimizations on it. When we are done with all optimizing passes, we exit SSA code. This means clearing out PHI opcodes and renaming back fixed SSA vars (*).

Additional var tables

Interp variables are identified by an index into the global td->vars table. This table contains basic information for each var. This table can get very large and it is problematic for SSA algorithms which only need to operate on a subset of these vars (most vars are created from simple push/pop operations from IL and they are already in SSA form). We create 2 new tables with information attached to an interp var via an ext_index. The first table is renamable_vars. This is a subset of the list of original vars that will end up being renamed. It also contains additional information to be used when renaming vars (ex renamed var stack) and it also provides a way for a renamed var to reference the original var that it is renamed from. We also create a new table renamed_fixed_vars that contains additional liveness information necessary for each renamed var of a fixed original var.

Types of vars

When SSA transformation is complete, optimizations can encounter the following types of vars (as operands to instructions):

  • Indirect vars. These are vars that are a source var toldloca. They will not be in SSA form, they can have multiple definitions and we will not do any type of optimizations to them. We will still apply optimization to the result of ldloca so the address can end up not being used and bblocks continaing ldloca can be optimized out. In these cases, some vars can no longer be indirect once we finish the optimization step. If that happens, after we exit SSA, we go back in, this time applying optimizations to these vars.
  • No SSA vars that are not indirect. Currently these are variables that are referenced from EH handlers. These are variables that will be optimized pretty much the same way as before this PR. They will have their value tracked over a single basic block and their value will not be propagated from another basic block.
  • Normal SSA vars. These are vars that renamed from non-fixed vars or vars that were not marked as no-ssa and were not renamable (aka vars that were already in SSA form from the start). This is the most common set of vars, that supports all optimizations. Aside from taking part in cross bblock optimizations, another difference from no-ssa vars is that they can actually have their value stored in a different order. (For no-ssa vars we preserve the store order since an exception can be thrown at any point). Renamed vars of these type will not be renamed back, they will continue to be used as standalone vars once we exit SSA form.
  • Renamed fixed SSA vars. These are vars that are renamed from a fixed renamable var. Renamed vars from a fixed renamable var must all have the same offset. This is implemented by reverting the renaming of all vars once we exit SSA, meaning that, in the final code, we will end up using the same original fixed renamable var. All optimizations applied to these vars will have to respect this offset allocation constraint. This translates to the following condition: at any point in the code, there must be a single live value of any of the renamed vars of the same fixed renamable var. This is implemented by having additional liveness information computed as we do the var renaming, specifically live_out_bblocks (bblocks where the renamed var is live at the end of the bblock) and live_limit_bblocks (for each bblock we remember the last instruction index where the renamed var is still alive). Whenever we apply an optimization to a fixed renamed var, we will ensure that the new var liveness is not outside of these precomputed ranges for the current bblock we are in. This ensures that we will not have overlapping liveness between two renamed fixed vars. Fixed renamable vars are necessary for the following reasons:
    • the theoretical implementation of phi nodes would be to fetch the correct value of the var from each incoming bblock. This would mean inserting moves from each incoming edge. This would be extremely expensive, especially for the interpreter. The alternative solution to this is to simply remove the phi nodes when exiting SSA, while ensuring all these vars (vars that are args to the phi opcode) are allocated to the same offset, so therefore no moving is necessary.
    • vars that are part of the patchpoint state during tiering have to be fixed. When we tier up a method from a patchpoint, we copy the entire IL locals space. The optimized method needs to be able to obtain the values of the IL locals from the same offset as in the unoptimized method. This means that all vars that are live at the location of a patchpoint will be marked as fixed (luckily for us, we already have this liveness information readily available as part of the pruned SSA computation).

Exceptional path

Before this PR, the linking of bblocks from exception handlers was rather arbitrary, not following a clear plan, with many bblocks not having correctly set OUT bblocks. This was problematic for SSA transformation, which expects a well formed CFG. This PR makes it such that the CFG only contains the non-exceptional PATH. No exception handlers (not even finally blocks) are linked to the rest of the bblocks. All variables used in an exception handler will be marked as no-ssa. We will still run optimizations to code in handlers, but we will not do any cross-bblock propagation of values.



The PR is best reviewed with each individual commit. Each commit follows an incremental approach to the problem and additional explanations are present in the description/code comments. From what I could tell in the methods that I tested, this PR leads to roughly 3-12% improvement in code size/performance. Running the optimizations seems to take about 4x as much as before, this could likely be improved.

Author: BrzVlad
Assignees: BrzVlad
Labels:

area-Codegen-Interpreter-mono

Milestone: -

@BrzVlad
Copy link
Member Author

BrzVlad commented Jan 18, 2024

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vargaz
Copy link
Contributor

vargaz commented Jan 20, 2024

The commits up-to and includling '[Remove flags and use bit fields instead]' look ok to me, maybe they can be merged to reduce the size of this PR.
Will it be possible to turn off SSA, or its going to be the default ?

@BrzVlad
Copy link
Member Author

BrzVlad commented Jan 20, 2024

@vargaz #97249. I initially planned to completely replace old optimization with SSA based ones. Over the course of the implementation I stumbled across several scenarios where SSA is awkward (weirdly linked bblocks in CFG, bblocks from exception handlers, huge methods where SSA transformation takes ages). All these scenarios are properly handled by having a no-ssa mode. So default is SSA enabled, but SSA can be disabled for the entire method (--interp=-ssa). No-ssa mode is pretty much identical to what we had before, but it is expected in some cases to generate slightly worse code because I didn't prioritize its codegen quality. All optimizations were tweaked with SSA enabled in mind.

@BrzVlad
Copy link
Member Author

BrzVlad commented Jan 22, 2024

@dotnet/jit-contrib Not sure if this is the proper way to tag. Feel free to take a look over this, mainly at the descriptions and comments.

@BrzVlad
Copy link
Member Author

BrzVlad commented Jan 31, 2024

We also don't keep track of variables are potentially live cross-block, though we have thought of doing so.

Is this about propagating the use of SSA vars (that have phi nodes) in other bblocks ?

Can you explain more about the "redo optimizations" aspect? Does this mean you are dropping and rebuilding SSA multiple times? Ryujit only does this once (currently) since we worry about the cost.

Yes. There are two main scenarios where this is helpful. First, if a var has its address taken, then no optimizations can be applied to it, also that var won't be in SSA form. If a var is no longer indirect, this means that a lot of the work would need to be redone (like inserting phi nodes and doing renaming for the var in question). Also, following optimizations, the CFG can change, me might no longer need to insert phi nodes in some cases, potentially generating better code. While it is not optimal, for simplicity, we do completely rebuild SSA.

I'm curious how Mono detects these cases and switches into the no-ssa mode.

Well, currently, the only scenario where we forcefully enter no-ssa mode is when we have a throw that is caught in the same method. We don't have the bblocks properly linked to the end of the catch handler, so preliminary CFG related algorithms for SSA don't work. For simplicity we just disable SSA transformation, since these cases shouldn't be common. The no_ssa restriction can also apply to individual variables, and it pretty much limits the optimizations on that var to be within the same bblock where the var is defined. So we still run a pretty decent amount of optimizations in these cases.

Note that even in isolated code like push v; pop v, sometimes I'm refering to the var v as being a SSA var, but it is in SSA form by default. Other vars, that are defined multiple times, will be part of the renamable vars set, and these are the vars for which we do liveness computation, for pruned ssa. We have no limit on the renamable vars set size. A subset of the renamable vars are fixed (all of the renamed vars will have to share same final offset, ex. a var that has phi), and for them we will do even heavier liveness related computation, marking the limit to which we can extend the use of the var, without conflicting with other fixed vars sharing same offset.

@jakobbotsch
Copy link
Member

Is this about propagating the use of SSA vars (that have phi nodes) in other bblocks ?

I think what Andy is referring to is that today, the JIT treats locals that are used across blocks the same as locals that are used only within single blocks. We could have the single-block locals not count towards the tracking limit and not spend any time during interblock dataflow computations on them (e.g. during liveness), which would be more efficient. We have an open issue about it -- #72740.

A subset of the renamable vars are fixed (all of the renamed vars will have to share same final offset, ex. a var that has phi), and for them we will do even heavier liveness related computation, marking the limit to which we can extend the use of the var, without conflicting with other fixed vars sharing same offset.

I think Andy has mentioned it before, but we do not have an out-of-SSA phase in the JIT, so if I understand it correctly we essentially only have what you describe as "Renamed fixed SSA vars". We are conservative in some places due to it, e.g.

// Check whether the newLclNum is live before being substituted. Otherwise, we could end
// up in a situation where there must've been a phi node that got pruned because the variable
// is not live anymore. For example,
// if
// x0 = 1
// else
// x1 = 2
// print(c) <-- x is not live here. Let's say 'c' shares the value number with "x0."
//
// If we simply substituted 'c' with "x0", we would be wrong. Ideally, there would be a phi
// node x2 = phi(x0, x1) which can then be used to substitute 'c' with. But because of pruning
// there would be no such phi node. To solve this we'll check if 'x' is live, before replacing
// 'c' with 'x.'
// We compute liveness only on tracked variables. And all SSA locals are tracked.
assert(newLclVarDsc->lvTracked);
// Because of this dependence on live variable analysis, CopyProp phase is immediately
// after Liveness, SSA and VN.
if ((newLclNum != info.compThisArg) && !VarSetOps::IsMember(this, compCurLife, newLclVarDsc->lvVarIndex))
{
continue;
}

It sounds like you do allow this with your "Normal SSA vars". I'm curious how you eliminate phis for them -- do you have an out-of-SSA phase? Or have I misunderstood what they represent?

It seems like our liveness representations are a bit different. For us, we store liveness of all the tracked locals at the boundaries of every basic block. In addition, we store single-bit annotations on uses of locals that indicate whether or not this is a last use. It is not possible to efficiently query liveness for a particular local at an arbitrary point within a basic block (which I assume your live_limit_bblocks allows) -- to do that you must "replay" the IR from the beginning of the basic block up to the point. The replay makes use of the "last use" information to update a single liveness set that you can then query. This is usually not a problem for us since the passes that need information like this (such as copy prop) are processing all the IR in order anyway.

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 1, 2024

The example from the comments is actually something I didn't think about initially and later had to adjust the implementation. So consider this code:

c = x0
if (?)
    x1 = 1;
print(c)

x is dead at the point of calling print. There are two situations here. First case is if x is a "tracked" var with no restrictions. This means that print(c) can be replaced with print(x0), however x1 will no longer share the same allocated offset with x0 in the end. This means that when we exit SSA, x0 and x1 will not be renamed back to x, but we will end up with 2 separate vars, independent of each other. These vars have no phi nodes, so there is no out of SSA work related to phis. The var offset allocator, towards the end of codegen, will see that these vars can be alive at the same time and separate stack offsets will be allocated for them.

The second scenario is when x is what I call a fixed var (this happens if somewhere else in the code there was a phi node introduced for x, or if x is part of the locals state at a tiering patchpoint location). If that is the case, we no longer have the freedom of splitting up x into separate vars x0 and x1. The out of ssa phase for these vars means that they will be renamed back to the original var x. The way to then prevent incorrect cprop is by having liveness extension limit for the renamed vars. This is stored either as a bitset live_out_bblocks for blocks where the var is live at the end of the bblock (so we can freely forward propagate its use within the bblock), or more granular with the live_limit_bblocks (which saves the location where another conflicting renamed var gets declared in the bblock).

In this particular example, before print(c) there would have normally been a phi(x) which pruned SSA gets rid of. However, we still remember this information here, by still emitting a special dead_phi(x) opcode. Dead phis have the purpose of limiting the extension of liveness for all renamed vars of x that reach that point, so in our case the bblock containing print will no longer be in the live_out_bblocks set for var x0, preventing its propagation. All these calculations are done in the renaming phase. Note there can be a bazillion dead phi opcodes generated, but it didn't seem to me to impact perf that bad, I prioritized code gen quality however (also one of the reasons I run in no-ssa mode the first optimization iteration for huge methods)

@AndyAyersMS
Copy link
Member

The second scenario is when x is what I call a fixed var (this happens if somewhere else in the code there was a phi node introduced for x, or if x is part of the locals state at a tiering patchpoint location). If that is the case, we no longer have the freedom of splitting up x into separate vars x0 and x1.

One thing I've wanted to do in RyuJit is to rename disjoint SSA webs right after building SSA. That is, like you mention above, ssa defs are tied together if they appear in the same phi. So after building SSA we'd find the transitive closure of the "in same phi" relation to build these webs; they exhaustively partition the SSA defs. Each web then gets renamed. Having too many locals has a cost for us so I was going to temper this with a second relation, so that if the last use of a def feeds into another def we also tie those defs together. So something like a long chain of x += 1 doesn't create hundreds of renames with little benefit.

The hope is that this would both simplify life for the allocator and provide a bit more flexibility for the optimizer. We don't currently track the uses of each def so this rewriting requires some more bookkeeping than we do right now.

This also messes up our optimized code debugging story somewhat, as it's now possible for a user local to have multiple values at a program point, but we've never made any guarantees about optimized code debuggability and we don't do that much code motion, so it's likely not a big degradation in debug info quality.

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 1, 2024

So am I understanding correctly that, currently on RyuJIT, all renamed SSA vars for an original var will be renamed back to the original var when exiting SSA, this being the reason for the above example not being handled optimally ? Even if there are no phi nodes involved with these renames ?

@BrzVlad BrzVlad merged commit 304cedf into dotnet:main Feb 10, 2024
108 of 111 checks passed
@kg
Copy link
Member

kg commented Feb 12, 2024

Might be a jiterpreter bug exposed by these changes, too

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 13, 2024

@kg Seems like your intuition was correct that it was jiterpreter related. #98345

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 13, 2024

dotnet/perf-autofiling-issues#29159

@kg
Copy link
Member

kg commented Feb 13, 2024

@BrzVlad we're seeing a regression in json benchmark after this

https://radekdoulik.github.io/WasmPerformanceMeasurements/?startDate=2024-02-10T03%3A02%3A22.000Z&endDate=2024-02-10T14%3A07%3A27.000Z&tasks=%2CJson&flavors=0%2C1%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C13%2C14%2C15%2C2%2C3

This one's now meaningfully lower, so it looks like SSA also improved performance on wasm as we'd expect

@kg
Copy link
Member

kg commented Feb 14, 2024

SequenceEqual looks like it got worse, though.
image

@kg
Copy link
Member

kg commented Feb 19, 2024

SequenceEqual looks like it got worse, though. image

SequenceEqual was a jiterp bug, which is now fixed. The fix looks like it made some other measurements slightly slower, likely because the traces were missing branch targets before.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants