-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate improving JIT heuristics with machine learning #92915
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsOverviewThere is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers. Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient. Proposed WorkDuring the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses. We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):
cc @dotnet/jit-contrib
|
Initial Notes on the Current JIT CSE Heuristic
|
Introduce two config vars for experimenting with CSEs: * `JitCSEHash`: identifies a method for CSE experimentatin * `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt") When the hash is nonzero, any method whose hash matches will perform only the subset of CSEs specified by the mask (up to 32 CSEs). Also introduce a config var to dump the total number of CSEs to either the assemby listing footer or the one-liner from the disassembly summary. * `JitMetrics` This can perhaps be generalized eventually to report more metrics and perhaps to report them back to SPMI when it is the jit host. Finally, note CSE lcl vars that represent hoisted trees and or are "multiple-def" CSEs in the local var table. Contributes to dotnet#92915.
Perf vs Perf ScoreHere's a chart of about 3000 measurements across 50 or so benchmarks, varying the set of CSEs we do in the hottest method. Each data point is the ratio of the perf vs the default JIT perf, vs the ratio of the perf score vs the default JIT perf score. Thus values above 1.0 on the y-axis are benchmarks that ran more slowly, and values to the right of 1.0 on the x-axis are benchmarks that were predicted to run more slowly. We see most data in the "slower and predicted slower" (upper right) quadrant because CSEs generally improve perf, so doing less of them than we'd normally do will likely slow down the benchmark. Linear fit has a correlation coefficient of about 0.3, so there is a fair amount of unexplained variation in perf. Also I have excluded one benchmark's data from this as it seems anomalous. Ideally, we'd be able to get a better correlation out of perf scores, since these can be produced very quickly as we vary JIT behavior, on the order of 1000 experiments per second. If we actually run benchmarks then it takes around 30 seconds for one experiment, so there is a massive benefit to leveraging perf scores. The main contributors to poor perf predictability are thought to be (in no particular order):
So the task here is to dig into these issues and see if and how perf scores can be improved. Also I need to look into why MaximizeSchwarzCriterion has anomalous results; including data from this benchmark drops R^2 to essentially zero. |
This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :) |
I think I may have read some papers by the Graal folks where they leveraged ML, but I don't have the references handy. Presumably they'd be somewhere in the big list of papers I referred to above. |
Introduce two config vars for experimenting with CSEs: * `JitCSEHash`: identifies a method for CSE experimentatin * `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt") When the hash is nonzero, any method whose hash matches will perform only the subset of CSEs specified by the mask (up to 32 CSEs). Also introduce a config var to dump the total number of CSEs to either the assemby listing footer or the one-liner from the disassembly summary. * `JitMetrics` This can perhaps be generalized eventually to report more metrics and perhaps to report them back to SPMI when it is the jit host. Finally, note CSE lcl vars that represent hoisted trees and or are "multiple-def" CSEs in the local var table. Contributes to #92915.
Have been digging into why the PerfScores have such poor correlation on MaximizeSchwarzCriterion and it seems like it is a combination of a few effects:
Thus the block weights (in particular the large weights for innermost loop blocks) can vary quite a bit from one run to the next, and that seems to be the major factor. You can see this by just running the benchmark multiple times with the same jit configuration:
In particular the first line of each set above shows two "identical" runs with default JIT behavior, the first took 59,311us with PerfScore 649,125,874.85 and the second 61,020us with PerfScore 100,478,799.14. So perf varied by about 1.03x and PerfScore about 6x. For the purposes of this investigation, we can just exclude methods that have OSR methods. This variation in weights is possibly less of a problem in practice though it may be a contributor to the kind of fluctuating performance we noted in #87324. We should think about better strategies to counteract it:
|
Isn't this already the case for R2Red code? |
Yes -- the idea there was to avoid needing to go from R2R (optimized) -> Tier0 (unoptimized) to gather PGO data. Here the idea would be to go from Tier0+instr -> (OSR) -> Tier1+ instr -> Tier1, for methods that are called infrequently (but more than once) and contain hot loops. The second instrumented run is there to fill out the missing bits of profile in the parts of the method that currently only run under OSR. |
Some more thoughts on applying ML to the CSE heuristic. My current thinking is to approach this as a model-based replacement learning application. We think we know which CSEs are good ones (our initial model) and we don't have any repository of known good results (aka labels), and possibly may incur significant costs in obtaining data. Our goal is to improve or find a better model. For the initial cut we can just rely on perf scores (flawed as they may be) as the overall approach won't be that impacted; if and when we have it working well, we can think about mixing in actual measurements. Perf scores (if viable) offer a lot of advantages including the ability to assess results on methods that are not prominent in any benchmarks. As a prelude to all that I want to elaborate the current model and figure out the right way to parameterize it. Currently CSE candidates are ranked by a sort function, and then attempted in rank order. As CSEs are performed the acceptance thresholds ratchet upwards so there is an order dependence. So we might hope that the reinforcement-learned model could be expressed as some numeric score with say higher values indicating higher preferences and lower values lower ones, and perhaps zero or negative values indicating CSEs that should not be done. The ratcheting is an attempt to model register pressure. The thresholds are loosely set based upon properties of the weighted local var counts (here I am ignoring for now the "optimize for size" mode), eg the aggressive cutoff is the weight of the 13th highest ranked local var (pre-CSE). More broadly one can imagine that the model would have access to general properties of the lcl var weight distribution, expressed as a sort of CDF, eg we would like to know roughly how many lcl vars do we have to consider to reach say 90% of all (weighted) local var occurrences. Broadly speaking a particular CSE we will add a new local var whose weight we can compute (from the def and use weights) and remove weight from any of the locals. The current heuristic roughly estimates the first (via the ratchet) but does not take into account the second. Thus for instance if we CSEd a large computation tree to affect a hoist within a loop we might actually expect the overall register pressure to decrease. Thinking of this all as a sort of Markov Decision Process, as we do one CSE the viability of the others will shift. If we want the model to handle this sort of order dependence, then after each CSE we need to re-evaluate the scores of the remaining candidates. I will note along these lines that the current CSE heuristic has some quirks that seem interesting. For example in a simple case with just two candidates A & B, if we order them A, B we do both CSEs, if we order them B, A, we do just B, because B is is call-crossing and has a higher ratchet and makes A look unattractive. This struck me as odd. So, factors that might feed into the model:
Some preliminary sub problems to examine:
|
Here is some data on the current heuristc, from the asp.net SPMI collection [full table here]
I was surprised to see such low odds for accepting candidates, on average only about 0.25. So there is certainly headroom to be more aggressive. For these methods with small numbers of candidates it should be feasible to exhaustively enumerate all the possible sets of CSEs and look at the resulting spectrum of perf scores, and chart out how well the existing heuristic performs (as measured by perf score) vs the best score. For larger sets we can at least attempt to gather a decent sample of data. Working on that now. |
Here is some initial data on the performance of the current heuristic. Using the asp.net SPMI collection I sorted 100 candidates from each of the categories above, and then ran The preliminary data looks like this:
So the rough trend is that as the number of candidates increases, the likelihood of the default heuristic getting the best possible perf score decreases, and in aggregate the default heuristic falls short of the best perf score by around 0.6% or so. This data took a few minutes to gather; I'll allow it to run on more cases overnight. I have some rough edges to polish here and to explore some of the even larger candidate cases I'll need some kind of monte-carlo strategy as the exponential number of runs start becoming a factor. |
Here is the "complete" set of results from asp.net, up through cases with 11 cse candidates. My current approach is going to take to long to get data for higher numbers of candidates, eg for an 11 candidate case we need 2^11 runs of SPMI over the method context varying the CSE mask, so the entire set of 526 case required over 1 million runs.
Next steps from here:
|
I figured out what was leading to (2) above -- I'd forgotten that the config settings get parsed as hex, so for the 4+ candidate cases the tool was not properly exploring the full subset of possibilities. With that fixed I don't see any cases where the default heuristic can do better than a subset. Updated data shows that for the 4+ cases the heuristic is now a bit worse (not surprising as some of those missing subsets were going to be good ones) and the upside opportunity is a bit larger (also not surprising). Also added numbers on how many SPMI runs are required, note as the candidate counts increase the evaluation takes exponentially more runs, but the number of candidates falls off so the growth overall is a bit milder...
|
Fixed some other issues; latest full run data
|
I have a version of MCMC running for larger instances, and have verified that it matches up pretty well to exhaustive enumeration in the medium-sized cases. The trend noted above continues:
Using the above I can now generate "perf-score" optimal for cases with small CSEs, and can elaborate the entire value function Here green shows the optimal policy, dog-eared the current policy, square boxes are terminal states. As noted in (4) above there doesn't seem to be order-dependence in the values; if so that greatly cuts down the size of the state space. If there is order dependence then the state space grows faster than So I plan to screen all this data to see if I ever see a case where the order of CSEs matters for final perf score. If it never happens or is suitably rare, we'll just ignore it. The data above can't directly be used to craft a policy, since the state is a specific method instance plus a sequence of CSEs. In reality the jit won't have already seen a given method... so we need to generalize the state and try building a policy that way. My initial plan is to use a policy gradient method. We know some of the factors that should influence CSEs, these are already considered by the current policy, though perhaps not in optimal form:
So we can start with these. Since there is a large dynamic range for weighed costs I am tempted to use a log to keep the data a bit more reasonable. We also need to assign some preference to stop the process, so there is one extra action to model. The initial model will likely be linear, so the preference for an action (CSE) will be the dot product of the observations vector based on that CSE and ambient state, and a parameter vector that is all numeric and provides weights for each observation. This can be generalized to include nonlinear functions of observations (or combinations of observations) if necessary. For the reward we can use absolute perf scores, but this doesn't generate well across methods, so I am tempted to use the relative perf score over doing no CSEs (or could also be the relative score compared to the best known score). |
Adds special CSE heuristic modes to the JIT to support learning a good CSE heuristic via Policy Gradient, a form of reinforcement learning. The learning must be orchestrated by an external process, but the JIT does all of the actual gradient computations. The orchestration program will be added to jitutils. The overall process also relies on SPMI and the goal is to minimize perf score. Introduce two new CSE heuristic policies: * Replay: simply perform indicated sequence of CSEs * RL: used for the Policy Gradient, with 3 modes: * Stochastic: based on current parameters but allows random variation * Greedy: based on current parameters, deterministic * Update: compute updated parameters per Policy Gradient Also rework the Random policy to be a bit more random, it now alters both the CSEs performed and the order they are performed in. Add the ability to have jit config options that specify sequences of ints or doubles. Add the ability to just dump metric info for a jitted method, and add more details (perhaps considerably more) for CSEs. This is all still simple text format. Also factor out a common check for "non-viable" candidates -- these are CSE candidates that won't actually be CSEs. This leads to some minor diffs as the check is now slightly different for CSEs with zero uses and/or zero weighted uses. Contributes to dotnet#92915.
For reasons I find hard to justify I decided my Top number in each state is min perf score, bottom is the average. As the heuristic evolves one hopes the average approaches the min, as the fumbling initial steps are more or less forgotten (image below is after 100 runs). Double-circle indicates what the current jit CSE heuristic does. |
) Adds special CSE heuristic modes to the JIT to support learning a good CSE heuristic via Policy Gradient, a form of reinforcement learning. The learning must be orchestrated by an external process, but the JIT does all of the actual gradient computations. The orchestration program will be added to jitutils. The overall process also relies on SPMI and the goal is to minimize perf score. Introduce two new CSE heuristic policies: * Replay: simply perform indicated sequence of CSEs * RL: used for the Policy Gradient, with 3 modes: * Stochastic: based on current parameters but allows random variation * Greedy: based on current parameters, deterministic * Update: compute updated parameters per Policy Gradient Also rework the Random policy to be a bit more random, it now alters both the CSEs performed and the order they are performed in. Add the ability to have jit config options that specify sequences of ints or doubles. Add the ability to just dump metric info for a jitted method, and add more details (perhaps considerably more) for CSEs. This is all still simple text format. Also factor out a common check for "non-viable" candidates -- these are CSE candidates that won't actually be CSEs. This leads to some minor diffs as the check is now slightly different for CSEs with zero uses and/or zero weighted uses. Contributes to #92915.
Some initial attempts to improve the modelling of "stopping early" when doing CSEs. This adds a simplistic register pressure estimate modelled on the one we have now, where we find the weight of the Nth tracked local and use that as a reference weight for deciding when a CSE might start becoming costly. Contributes to dotnet#92915.
At this point I'm mainly watching how the PG evolves and looking at methods that it doesn't handle well, trying to understand what about the candidates in those cases is different/unusual, to see if it's a mistaken feature value, a missing feature value, or whatnot. An alternative idea is to use MCMC (or whatever) to broadly classify candidates as good or bad (since for the most part order does not seem to matter, so goodness/badness is not contextual, and features don't evolve much over the course of an episode). Good if they are part of the best CSE sequence, and bad if not. Then grab all these candidate features, and compute some sort of nearness/clustering analysis in the feature space (or build a classifier), to see if we can distinguish good or bad from the features; if we can't, then how can we expect a policy to distinguish them? If we find a pair of candidates with similar features where one is good and the other is bad, study those cases to figure out what is going on. |
Also intrigued by the ideas in Simple random search provides a competitive approach |
Also for features, once we have decent/stable results we can try dropping features from the set to see how much the (lack of) a feature impairs overall results. Might also search for correlated features and try and decorrelate them (eg gram-schmidt orthogonalization or similar). Say the value of feature A predicts value of feature B pretty well, then it might be worth subtracting of the predicted B (via A) from the actual B, so the reported features are A and B-(A's prediction of B). The two ideas above are similar, if A and B are correlated then having them both provides little additional benefit over having just one of them. |
Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones: * where we use `log` for dynamic compresson, bias results to they are always non-negative * only consider integral vars for pressure estimate * note if a CSE has a call * note weighted tree costs * note weighted local occurrences (approx pressure relief) * note spread of occurrences (as fraction of BBs) * note if CSE is something that can be contained (guess) * note if CSE is cheap (cost 2 or 3) and is something that can be contained * note if CSE might be "live across" a call in LSRA block ordering The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering. Contributes to #92915.
Have been looking into improving efficiency. The PolicyGradient has two phases -- a one-off where it runs method by method (using
So the method by method has ~6x overhead... I have tried building all the stuff needed into release, it is (perhaps) nearly there, but that won't explain the above, as both versions are running checked jits. The batch mode is what I'm used to on this box, about 1K methods/sec. |
Revise things a bit so that we can run the greedy ML heuristic in release mode, with a built-in set of "good" parameters. These parameters come from a policy gradient run over a 200 method training set, on the asp.net windows x64 collection. Contributes to dotnet#92915.
Add some minimal CI testing for this new CSE heuristic. Remove testing of cross-block assertion prop, since it's now on by default. Contributes to dotnet#92915.
Add some minimal CI testing for this new CSE heuristic. Remove testing of cross-block assertion prop, since it's now on by default. Contributes to #92915.
Here's a trial that enable the new heuristic: #98776 Have added some notes there on the behaviors. |
Add a new `--optimizeSize` option where PolicyGradient tries to reduce code size instead of perf score. Since there are now two possible optimization objectives we have the potential for trading off one for the other. Update MCMC to track the "pareto frontier" for the methods it explores, and save the data for visualization. Contributes to dotnet/runtime#92915.
Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.
Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.
Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to #92915.
Add a new `--optimizeSize` option where PolicyGradient tries to reduce code size instead of perf score. Since there are now two possible optimization objectives, we have the potential for trading off one for the other. Update MCMC to track the "pareto frontier" for the methods it explores and save the data for visualization. Contributes to dotnet/runtime#92915.
Closing as activities during .NET 9 is done. |
Overview
There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.
Here's a closely relevant paper and github repo: MLGO: a Machine Learning Guided Compiler
Optimizations Framework. Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).
Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.
Proposed Work
During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.
We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):
Also worth noting is Lee Culver's work to adapt the JIT CSE problem into a more standard gymnasium setting: JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856
Update May 2024.
Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:
The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:
We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.
The text was updated successfully, but these errors were encountered: