Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate improving JIT heuristics with machine learning #92915

Closed
3 of 8 tasks
AndyAyersMS opened this issue Oct 2, 2023 · 25 comments
Closed
3 of 8 tasks

Investigate improving JIT heuristics with machine learning #92915

AndyAyersMS opened this issue Oct 2, 2023 · 25 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Oct 2, 2023

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Here's a closely relevant paper and github repo: MLGO: a Machine Learning Guided Compiler
Optimizations Framework
. Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):


Update May 2024.

Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:

  • poor correlation of perf score with actual perf
  • few benchmarks whose scores critically depend on CSEs, or
  • those benchmarks whose perf does critically depend on CSEs are all easy/obvious cases
  • difficulty extracting small signal from noise (still lots of noisy benchmark results)

The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:

Best/base: 0.9913 [optimizing for  score]
vs Base    0.9957 Better 440 Same 1476 Worse 84
vs Best    1.0044 Better 0 Same 1607 Worse 393

Params     0.6093, 0.7750,-0.5701,-0.6827, 0.5060,-0.0514,-1.3563, 0.4515,-2.0999, 0.0000,-1.0623,-1.3723, 0.0000,-0.6143,-0.8286,-0.2956,-1.1433, 0.3418, 1.3964,-0.0043,-0.4237, 0.6097,-1.9773,-0.3684, 0.7246

Collecting greedy policy data via SPMI... done (27583 ms)
Greedy/Base (score): 34965 methods, 8375 better, 24792 same, 1797 worse,  0.9960 geomean
Best:   76679 @  0.7731
Worst:  82000 @  1.4512

Greedy/Base (size): 34965 methods, 4628 better, 24694 same, 5642 worse,  1.0005 geomean
Best:   15489 @  0.7000
Worst:  82000 @  1.4205

 
We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.

@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 2, 2023
@AndyAyersMS AndyAyersMS added this to the 9.0.0 milestone Oct 2, 2023
@AndyAyersMS AndyAyersMS self-assigned this Oct 2, 2023
@ghost
Copy link

ghost commented Oct 2, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):

  • Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations.
  • Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc).
  • See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed.
  • Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models.
  • See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64.
  • Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT.

cc @dotnet/jit-contrib

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: 9.0.0

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 2, 2023

Initial Notes on the Current JIT CSE Heuristic

  1. Does the initial frame size computation hold up well? Can we justify the 3/2/1 formula in CSEHeuristic::Initialize?
    a. Disable CSE, see how well number of tracked locals in registers maps up with the prediction. If not, see if there is some way to make a better prediction.
    b. Then slowly enable CSE, and see if this continues to hold up well.
    c. See how well frame size matches estimates; if not good, see if we can improve this somehow

  2. Do the aggressive/moderate thresholds make sense? That is, should they be higher or lower? Based on something else?
    a. trickier to see how to model this; perhaps keep processing CSEs and see when spilling starts?
    (that is, do a local sensitivity analysis) -- might be counterbalanced by the promotion code.
    b. how about the overall approach of setting these thresholds based on IR observations? Seems sensible but maybe there are other ideas to try?
    c. at the end of CSEHeuristic::Initialize we increase the thresholds if they seem to low. Why?
    d. also doesn't that increase mix up weighted and unweighted cases? How does it make sense to compare unweighted ref counts with BB_UNITY_WEIGHT?
    e. should we try some “true” register pressure estimate, or is that hopeless? We also don’t know how the pressure points intersect with potential CSE temp liveness.

  3. Is the assumption here that FP CSEs are rare true?

  4. How reliable are CostSz and CostEx?
    a. probably a can of worms... how would we even go about validating these?

  5. Should candidates be ranked by local benefit, or by global benefit, or some ratio of benefit / cost?
    a. That is, why doesn’t Compiler::optCSEcostCmpEx consider the csdUseWtCnt more prominently? Seems like the right figure of merit (when optimizing for speed is csdUseWtCnt * tree->CostEx()
    so a relatively “cheap” CSE that heavily used is preferable to a “costly” CSE that is not as heavily used.
    b. Should this factor in code size even in speed modes? Assuming CSEs generally decrease size a bit, we might want to bias the above towards the heavier ones slightly.
    c. Do the threshold increases in PerformCSE make sense? This gives a budget-like aspect to the ranking, so the order does matter. What if we ignored this?

  6. Are we missing candidates?
    a. review bail outs in Compiler::optIsCSEcandidate.
    b. review limitations in Compiler::optValnumCSE_Locate

  7. In ConsiderCandidates is the computation sensible?
    We have 3 models x 2 modes ({aggressive/moderate/conservative} x {size, speed}).
    Plus some stress modes (including random...)
    a. seems like we mix weighted/unweighted costs up here at times. eg extra_no_cost is unweighed (size) based, but we add it to no_cse_cost, which can be weighted (speed) based.
    b. various magic numbers and twisty decision logic
    c. various per-arch costs, some of them look dated

  8. Is the "shared constant" CSE code sensible? Here we disregard the low bits of constants and at use
    points we rematerialize the missing bits via an add.
    a. So use cost is a bit higher... but this does not seem to be modelled by the heuristics ?
    b. Is the "best base" value computation reasonable? Seems like we're trying to use small immediates here?

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Oct 3, 2023
Introduce two config vars for experimenting with CSEs:
* `JitCSEHash`: identifies a method for CSE experimentatin
* `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt")

When the hash is nonzero, any method whose hash matches will perform
only the subset of CSEs specified by the mask (up to 32 CSEs).

Also introduce a config var to dump the total number of CSEs to either
the assemby listing footer or the one-liner from the disassembly summary.
* `JitMetrics`

This can perhaps be generalized eventually to report more metrics
and perhaps to report them back to SPMI when it is the jit host.

Finally, note CSE lcl vars that represent hoisted trees and or are
"multiple-def" CSEs in the local var table.

Contributes to dotnet#92915.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 3, 2023

Perf vs Perf Score

Here's a chart of about 3000 measurements across 50 or so benchmarks, varying the set of CSEs we do in the hottest method. Each data point is the ratio of the perf vs the default JIT perf, vs the ratio of the perf score vs the default JIT perf score. Thus values above 1.0 on the y-axis are benchmarks that ran more slowly, and values to the right of 1.0 on the x-axis are benchmarks that were predicted to run more slowly.

We see most data in the "slower and predicted slower" (upper right) quadrant because CSEs generally improve perf, so doing less of them than we'd normally do will likely slow down the benchmark.

Linear fit has a correlation coefficient of about 0.3, so there is a fair amount of unexplained variation in perf. Also I have excluded one benchmark's data from this as it seems anomalous.

image

Ideally, we'd be able to get a better correlation out of perf scores, since these can be produced very quickly as we vary JIT behavior, on the order of 1000 experiments per second. If we actually run benchmarks then it takes around 30 seconds for one experiment, so there is a massive benefit to leveraging perf scores.

The main contributors to poor perf predictability are thought to be (in no particular order):

  • errors in perf score per-block computations: wrong latency for instructions, lack of modelling dependence, other microarchitectural effects. Some of this we may be able to fix by invoking tools like llvm-mca to either give us accurate data or help us spot errors in our modelling. And that in turn requires making the JIT disasm be something readily parseable by standard tools.
  • errors in block weights. Even though we have PGO data we may no longer have accurate block weights at the end of compilation. This is something we can improve, though fundamentally the JIT will be unable to correctly anticipate how weights may change because of optimizations, so some amount of error here is inevitable.
  • "noisy" benchmarks. We know many benchmarks show considerable within-run or cross-run variation. Some of these sources of noise include: intel JCC errata, alignment sensitivity of data, other subtle microarchitectural effects (say branch collision)
  • errors in PGO data. Note there is some deliberate randomness and some inherent randomness in PGO data. There may also be errors in the instrumentation, collection, or reconstruction processes. Also included here are potential phased behaviors by benchmarks (where PGO data does not reflect the actual benchmark block weights because the data was measured too early)

So the task here is to dig into these issues and see if and how perf scores can be improved.

Also I need to look into why MaximizeSchwarzCriterion has anomalous results; including data from this benchmark drops R^2 to essentially zero.

image

@ShreyasJejurkar
Copy link
Contributor

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

@AndyAyersMS
Copy link
Member Author

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

I think I may have read some papers by the Graal folks where they leveraged ML, but I don't have the references handy. Presumably they'd be somewhere in the big list of papers I referred to above.

AndyAyersMS added a commit that referenced this issue Oct 6, 2023
Introduce two config vars for experimenting with CSEs:
* `JitCSEHash`: identifies a method for CSE experimentatin
* `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt")

When the hash is nonzero, any method whose hash matches will perform
only the subset of CSEs specified by the mask (up to 32 CSEs).

Also introduce a config var to dump the total number of CSEs to either
the assemby listing footer or the one-liner from the disassembly summary.
* `JitMetrics`

This can perhaps be generalized eventually to report more metrics
and perhaps to report them back to SPMI when it is the jit host.

Finally, note CSE lcl vars that represent hoisted trees and or are
"multiple-def" CSEs in the local var table.

Contributes to #92915.
@AndyAyersMS
Copy link
Member Author

Have been digging into why the PerfScores have such poor correlation on MaximizeSchwarzCriterion and it seems like it is a combination of a few effects:

  • Scalable profiles are notably approximate and will vary from run to run
  • The exact timing of tiering up can vary, so the profile data will also vary from run to run
  • Block weight reconstruction is ill-conditioned (small input changes can lead to large output changes)
  • Methods that rely on OSR can end up with partial profiles. In particular for Tier0-instr methods that are called and never return we often do not have profile weight for the entry block since that is deduced from the returns.

Thus the block weights (in particular the large weights for innermost loop blocks) can vary quite a bit from one run to the next, and that seems to be the major factor. You can see this by just running the benchmark multiple times with the same jit configuration:

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,649125874.85,1.000,59311.6714,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,80505239.17,0.124,61848.4732,1.043
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1532,421090857.67,0.649,65608.5883,1.106
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,1307172767.70,2.014,59236.0308,0.999
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1584,108418141.52,0.167,65119.4767,1.098
Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,100478799.14,1.000,61020.7827,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,69888793.36,0.696,63267.6950,1.037
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1530,69621450.93,0.693,66241.6050,1.086
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,108183513.60,1.077,58789.5833,0.963
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1582,70447641.30,0.701,64906.9033,1.064

In particular the first line of each set above shows two "identical" runs with default JIT behavior, the first took 59,311us with PerfScore 649,125,874.85 and the second 61,020us with PerfScore 100,478,799.14. So perf varied by about 1.03x and PerfScore about 6x.

For the purposes of this investigation, we can just exclude methods that have OSR methods.

This variation in weights is possibly less of a problem in practice though it may be a contributor to the kind of fluctuating performance we noted in #87324. We should think about better strategies to counteract it:

  • more aggressive blending? We currently use 0.99 for blend factor, perhaps something more like 0.90?
  • Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?
  • adding extra probes based on patchpoint location that can recover missing counts. The interesting case is the one where the patchpoint helper transitions to the OSR method—the helper would need to increment the counter. And since this probe would not normally appear in the schema we'd have to figure out how to fit it in without disrupting anything else. We do something like this for throws already, but those are easy to anticipate.

@MichalPetryka
Copy link
Contributor

  • Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

@AndyAyersMS
Copy link
Member Author

  • Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

Yes -- the idea there was to avoid needing to go from R2R (optimized) -> Tier0 (unoptimized) to gather PGO data. Here the idea would be to go from Tier0+instr -> (OSR) -> Tier1+ instr -> Tier1, for methods that are called infrequently (but more than once) and contain hot loops. The second instrumented run is there to fill out the missing bits of profile in the parts of the method that currently only run under OSR.

@JulieLeeMSFT JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Oct 6, 2023
@AndyAyersMS
Copy link
Member Author

Some more thoughts on applying ML to the CSE heuristic.

My current thinking is to approach this as a model-based replacement learning application. We think we know which CSEs are good ones (our initial model) and we don't have any repository of known good results (aka labels), and possibly may incur significant costs in obtaining data. Our goal is to improve or find a better model.

For the initial cut we can just rely on perf scores (flawed as they may be) as the overall approach won't be that impacted; if and when we have it working well, we can think about mixing in actual measurements. Perf scores (if viable) offer a lot of advantages including the ability to assess results on methods that are not prominent in any benchmarks.

As a prelude to all that I want to elaborate the current model and figure out the right way to parameterize it. Currently CSE candidates are ranked by a sort function, and then attempted in rank order. As CSEs are performed the acceptance thresholds ratchet upwards so there is an order dependence. So we might hope that the reinforcement-learned model could be expressed as some numeric score with say higher values indicating higher preferences and lower values lower ones, and perhaps zero or negative values indicating CSEs that should not be done.

The ratcheting is an attempt to model register pressure. The thresholds are loosely set based upon properties of the weighted local var counts (here I am ignoring for now the "optimize for size" mode), eg the aggressive cutoff is the weight of the 13th highest ranked local var (pre-CSE). More broadly one can imagine that the model would have access to general properties of the lcl var weight distribution, expressed as a sort of CDF, eg we would like to know roughly how many lcl vars do we have to consider to reach say 90% of all (weighted) local var occurrences.

Broadly speaking a particular CSE we will add a new local var whose weight we can compute (from the def and use weights) and remove weight from any of the locals. The current heuristic roughly estimates the first (via the ratchet) but does not take into account the second. Thus for instance if we CSEd a large computation tree to affect a hoist within a loop we might actually expect the overall register pressure to decrease. Thinking of this all as a sort of Markov Decision Process, as we do one CSE the viability of the others will shift.

If we want the model to handle this sort of order dependence, then after each CSE we need to re-evaluate the scores of the remaining candidates.

I will note along these lines that the current CSE heuristic has some quirks that seem interesting. For example in a simple case with just two candidates A & B, if we order them A, B we do both CSEs, if we order them B, A, we do just B, because B is is call-crossing and has a higher ratchet and makes A look unattractive. This struck me as odd.

So, factors that might feed into the model:

  • current distribution of local var weights
    • perhaps per register class (?) right now we just look at int regs
    • with callee saves and call crossing CSEs taken into account
  • impact on this distribution for each candidate
  • so overall "score" includes benefit from the CSE plus anticipated extra costs in prolog & for spills
  • some kind of code complexity metric (maybe: how likely is the distribution going to lead to spills)
  • whether this CSE is GTF_MAKE_CSE (presumably those do better)
  • current cost data (weighted and unweighted costs, number of defs & uses, etc)

Some preliminary sub problems to examine:

  • what do these weighted local distributions look like? Can we model them simply somehow?
  • do we need a pressure heuristic?
  • do we see cases now where even with the current pressure heuristic we CSE too much?
    • modify existing policy to keep going, and look for cases where perf score increases
  • how good is the current heuristic
    • CSE randomly and see how well the current model fares overall versus the empirical distribution
    • here the figure of merit might be perf score / no CSE perf score

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 19, 2023

Here is some data on the current heuristc, from the asp.net SPMI collection [full table here]

115838 methods, 34802 methods with cses; 193807 cse candidates, 56973 cses
NumCandiates [count, odds]: ... details ...
 1 [n: 10259 o: 0.53]:  0: 4836;  1: 5423
 2 [n:  4913 o: 0.44]:  0: 1760;  1: 2028;  2: 1125
 3 [n:  4058 o: 0.36]:  0: 1351;  1: 1329;  2: 1031;  3:  347
 4 [n:  3454 o: 0.37]:  0:  788;  1: 1100;  2:  860;  3:  570;  4:  136
 5 [n:  2576 o: 0.35]:  0:  586;  1:  497;  2:  792;  3:  470;  4:  175;  5:   56
 6 [n:  1627 o: 0.31]:  0:  264;  1:  452;  2:  474;  3:  182;  4:  227;  5:   21;  6:    7
 ...

I was surprised to see such low odds for accepting candidates, on average only about 0.25. So there is certainly headroom to be more aggressive.

For these methods with small numbers of candidates it should be feasible to exhaustively enumerate all the possible sets of CSEs and look at the resulting spectrum of perf scores, and chart out how well the existing heuristic performs (as measured by perf score) vs the best score. For larger sets we can at least attempt to gather a decent sample of data.

Working on that now.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 20, 2023

Here is some initial data on the performance of the current heuristic.

Using the asp.net SPMI collection I sorted 100 candidates from each of the categories above, and then ran $2^n + 1$ experiments: the baseline run and then jitted once for each different subset of CSEs possible (say there were 5 CSE candidates, this is 33 runs), and then looked at how often the default heuristic got the best possible perf score, and the geomean ratio of the default perf score to the best across those 100 runs.

The preliminary data looks like this:

  --- 1 candidate cases: default heuristic was optimal in 90 of 100 cases; geomean 1.003
  --- 2 candidate cases: default heuristic was optimal in 81 of 100 cases; geomean 1.006
  --- 3 candidate cases: default heuristic was optimal in 65 of 100 cases; geomean 1.008
  --- 4 candidate cases: default heuristic was optimal in 62 of 100 cases; geomean 1.006
  --- 5 candidate cases: default heuristic was optimal in 50 of 100 cases; geomean 1.012
  --- 6 candidate cases: default heuristic was optimal in 49 of 100 cases; geomean 1.006

So the rough trend is that as the number of candidates increases, the likelihood of the default heuristic getting the best possible perf score decreases, and in aggregate the default heuristic falls short of the best perf score by around 0.6% or so.

This data took a few minutes to gather; I'll allow it to run on more cases overnight.

I have some rough edges to polish here and to explore some of the even larger candidate cases I'll need some kind of monte-carlo strategy as the exponential number of runs start becoming a factor.

@AndyAyersMS
Copy link
Member Author

Here is the "complete" set of results from asp.net, up through cases with 11 cse candidates. My current approach is going to take to long to get data for higher numbers of candidates, eg for an 11 candidate case we need 2^11 runs of SPMI over the method context varying the CSE mask, so the entire set of 526 case required over 1 million runs.

  ---  1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005
  ---  2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007
  ---  3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006
  ---  4 candidate cases: baseline heuristic was optimal in 2087 of  3454 cases 60.42%; geomean 1.005
  ---  5 candidate cases: baseline heuristic was optimal in 1382 of  2576 cases 53.65%; geomean 1.010
  ---  6 candidate cases: baseline heuristic was optimal in  825 of  1627 cases 50.71%; geomean 1.006
  ---  7 candidate cases: baseline heuristic was optimal in  492 of   999 cases 49.25%; geomean 1.006
  ---  8 candidate cases: baseline heuristic was optimal in  522 of   979 cases 53.32%; geomean 1.004
  ---  9 candidate cases: baseline heuristic was optimal in  357 of   711 cases 50.21%; geomean 1.004
  --- 10 candidate cases: baseline heuristic was optimal in  278 of   637 cases 43.64%; geomean 1.004
  --- 11 candidate cases: baseline heuristic was optimal in  236 of   526 cases 44.87%; geomean 1.004

Next steps from here:

  1. Consider making the above faster. We should be able to use a release jit (if we make perf score and other metrics available) and run the various cases concurrently. That would speed up the data gathering by maybe 20x. We could also fan out to multiple machines...
  2. In some of the runs above the default heuristic is able to get a better perf score than any of the masked variants. Figure out why.
  3. In some of the runs above the SPMI replay fails with some CSE subsets. The ones I looked at were missing helper call infos for write barrier helpers. See if we can perhaps extend the SPMI collection to make replay more robust.
  4. I have been assuming there is no order dependence in in the masked CSE replays. Say there are 3 candidates: the above process will do 8 runs: default, {}, {1}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}. But if there is order dependence, we should also be running the various permutations: {2, 1}, {3, 1}, {3, 2}, {1, 3, 2} {2, 1, 3}, {2, 3, 1}, {3, 1, 2}, {3, 2, 1}. We might need to do this anyways since we will likely be estimating the reward (benefit) of the CSE via some sort of "temporal difference" scheme. At any rate we should validate the order dependence or independence (order dependence may explain point 2 above).
  5. Even if there is no order dependence in results there might be one in modelling. Consider the case of a nested CSE. If 2 nests within 1, then we get slightly different IR if we do {1, 2} vs {2, 1} -- abstractly this should not matter for codegen but in practice it might. But when looking at rewards, the reward for 2 will be diminished if we do 1 first, and the reward for 1 will be diminished if we do 2 first (and it's possible that 2 becomes non-viable after 1, if all its uses are subsumed by uses of 1).
  6. Assuming we're happy with the data gathered above, the next step is to figure out how to make us of it to craft a policy. The rough thinking is to come up with a set of observations that we think are relevant (we already have some) and build some sort parameterized evaluation function. We then redo the runs above and record the metrics at each step, and then use the reinforcement learning algorithm to gradually modify the evaluation function so that it increasingly is able to pick the optimal sequences of CSEs.
  7. We should employ good hygene here and not over-fit, so doing some kind of 5 or 10-fold cross validation seems sensible. That means running the above 10 times.
  8. We should also try other collections / arches /os to see how well that aspect generalizes.
  9. All of this is "perf score optimal" and may not reflect reality, so we still need to sanity check with actual results somehow.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 22, 2023

I figured out what was leading to (2) above -- I'd forgotten that the config settings get parsed as hex, so for the 4+ candidate cases the tool was not properly exploring the full subset of possibilities. With that fixed I don't see any cases where the default heuristic can do better than a subset.

Updated data shows that for the 4+ cases the heuristic is now a bit worse (not surprising as some of those missing subsets were going to be good ones) and the upside opportunity is a bit larger (also not surprising). Also added numbers on how many SPMI runs are required, note as the candidate counts increase the evaluation takes exponentially more runs, but the number of candidates falls off so the growth overall is a bit milder...

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)

@AndyAyersMS
Copy link
Member Author

Fixed some other issues; latest full run data

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)
  --- 8 candidate cases: baseline heuristic was optimal in  454 of   979 cases 46.37%; geomean 1.009 (251603 runs)

@AndyAyersMS
Copy link
Member Author

I have a version of MCMC running for larger instances, and have verified that it matches up pretty well to exhaustive enumeration in the medium-sized cases. The trend noted above continues:

  ---  6 candidate cases: baseline heuristic was optimal in 734 of 1627 cases 45.11%; geomean 1.009 baseline win from CSE 0.960 max win 0.951 (105755 runs in    775,494ms)
  ---  7 candidate cases: baseline heuristic was optimal in 415 of  999 cases 41.54%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (128871 runs in  1,742,522ms)
  ---  8 candidate cases: baseline heuristic was optimal in 451 of  979 cases 46.07%; geomean 1.009 baseline win from CSE 0.958 max win 0.949 (251603 runs in  3,634,848ms)
  ---  9 candidate cases: baseline heuristic was optimal in 278 of  711 cases 39.10%; geomean 1.008 baseline win from CSE 0.956 max win 0.949 (182727 runs in  5,070,407ms)
  --- 10 candidate cases: baseline heuristic was optimal in 183 of  637 cases 28.73%; geomean 1.008 baseline win from CSE 0.959 max win 0.951 (163709 runs in  6,406,325ms)
  --- 11 candidate cases: baseline heuristic was optimal in 142 of  526 cases 27.00%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (135182 runs in  7,543,211ms)
  --- 12 candidate cases: baseline heuristic was optimal in 231 of  535 cases 43.18%; geomean 1.009 baseline win from CSE 0.966 max win 0.957 (137495 runs in  8,664,782ms)
  --- 13 candidate cases: baseline heuristic was optimal in  86 of  342 cases 25.15%; geomean 1.009 baseline win from CSE 0.962 max win 0.953 (87894  runs in  9,376,476ms)
  --- 14 candidate cases: baseline heuristic was optimal in  63 of  265 cases 23.77%; geomean 1.009 baseline win from CSE 0.943 max win 0.935 (68105  runs in  9,937,161ms)
  --- 15 candidate cases: baseline heuristic was optimal in  51 of  237 cases 21.52%; geomean 1.009 baseline win from CSE 0.959 max win 0.951 (60909  runs in 10,473,161ms)
  --- 16 candidate cases: baseline heuristic was optimal in  75 of  262 cases 28.63%; geomean 1.007 baseline win from CSE 0.967 max win 0.960 (67334  runs in 11,120,183ms)
  --- 17 candidate cases: baseline heuristic was optimal in  72 of  246 cases 29.27%; geomean 1.007 baseline win from CSE 0.976 max win 0.969 (63222  runs in 11,702,941ms)
  --- 18 candidate cases: baseline heuristic was optimal in  34 of  196 cases 17.35%; geomean 1.007 baseline win from CSE 0.959 max win 0.952 (50372  runs in 12,185,341ms)
  --- 19 candidate cases: baseline heuristic was optimal in  29 of  198 cases 14.65%; geomean 1.010 baseline win from CSE 0.968 max win 0.959 (50886  runs in 12,697,428ms)
  --- 20 candidate cases: baseline heuristic was optimal in  24 of  172 cases 13.95%; geomean 1.010 baseline win from CSE 0.961 max win 0.951 (44204  runs in 13,134,515ms)
  --- 21 candidate cases: baseline heuristic was optimal in   6 of  110 cases  5.45%; geomean 1.010 baseline win from CSE 0.953 max win 0.944 (28270  runs in 13,445,214ms)
  --- 22 candidate cases: baseline heuristic was optimal in  13 of   97 cases 13.40%; geomean 1.006 baseline win from CSE 0.955 max win 0.949 (24929  runs in 13,706,788ms)
  --- 23 candidate cases: baseline heuristic was optimal in   4 of   92 cases  4.35%; geomean 1.011 baseline win from CSE 0.966 max win 0.955 (23644  runs in 14,004,816ms)
  --- 24 candidate cases: baseline heuristic was optimal in   5 of  119 cases  4.20%; geomean 1.013 baseline win from CSE 0.967 max win 0.955 (30583  runs in 14,316,467ms)
  --- 25 candidate cases: baseline heuristic was optimal in   7 of   63 cases 11.11%; geomean 1.011 baseline win from CSE 0.967 max win 0.957 (16191  runs in 14,510,376ms)
  --- 26 candidate cases: baseline heuristic was optimal in   2 of   97 cases  2.06%; geomean 1.010 baseline win from CSE 0.974 max win 0.964 (24929  runs in 14,789,976ms)
  --- 27 candidate cases: baseline heuristic was optimal in   1 of   74 cases  1.35%; geomean 1.022 baseline win from CSE 0.954 max win 0.934 (19018  runs in 15,023,453ms)
  --- 28 candidate cases: baseline heuristic was optimal in   7 of   73 cases  9.59%; geomean 1.015 baseline win from CSE 0.974 max win 0.960 (18761  runs in 15,263,888ms)
  --- 29 candidate cases: baseline heuristic was optimal in   1 of   69 cases  1.45%; geomean 1.009 baseline win from CSE 0.975 max win 0.966 (17733  runs in 15,469,362ms)
  --- 30 candidate cases: baseline heuristic was optimal in  17 of   66 cases 25.76%; geomean 1.011 baseline win from CSE 0.945 max win 0.934 (16962  runs in 15,683,842ms)

Using the above I can now generate "perf-score" optimal for cases with small CSEs, and can elaborate the entire value function $V(s, a)$ either explicitly or graphically... eg

image

Here green shows the optimal policy, dog-eared the current policy, square boxes are terminal states.

As noted in (4) above there doesn't seem to be order-dependence in the values; if so that greatly cuts down the size of the state space. If there is order dependence then the state space grows faster than $n!$ -- actual growth is https://oeis.org/A000522. If there is no order dependence than it's just $2^n$. Still need MCMC for larger cases, but can cover a much greater fraction.

So I plan to screen all this data to see if I ever see a case where the order of CSEs matters for final perf score. If it never happens or is suitably rare, we'll just ignore it.


The data above can't directly be used to craft a policy, since the state is a specific method instance plus a sequence of CSEs. In reality the jit won't have already seen a given method... so we need to generalize the state and try building a policy that way.

My initial plan is to use a policy gradient method. We know some of the factors that should influence CSEs, these are already considered by the current policy, though perhaps not in optimal form:

  • weighted and unweighted costs and sizes of the CSE uses and defs
  • numbers of defs and uses
  • whether the CSE is call crossing nor not
  • an estimate of register pressure -- currently we use the weighed cost of the Nth integer local, where N is initially found from a scan of the local var table, and increases somewhat with each successive CSE

So we can start with these. Since there is a large dynamic range for weighed costs I am tempted to use a log to keep the data a bit more reasonable. We also need to assign some preference to stop the process, so there is one extra action to model. The initial model will likely be linear, so the preference for an action (CSE) will be the dot product of the observations vector based on that CSE and ambient state, and a parameter vector that is all numeric and provides weights for each observation. This can be generalized to include nonlinear functions of observations (or combinations of observations) if necessary.

For the reward we can use absolute perf scores, but this doesn't generate well across methods, so I am tempted to use the relative perf score over doing no CSEs (or could also be the relative score compared to the best known score).

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jan 22, 2024
Adds special CSE heuristic modes to the JIT to support learning a good CSE
heuristic via Policy Gradient, a form of reinforcement learning. The learning
must be orchestrated by an external process, but the JIT does all of the
actual gradient computations.

The orchestration program will be added to jitutils. The overall process
also relies on SPMI and the goal is to minimize perf score.

Introduce two new CSE heuristic policies:
* Replay: simply perform indicated sequence of CSEs
* RL: used for the Policy Gradient, with 3 modes:
  * Stochastic: based on current parameters but allows random variation
  * Greedy: based on current parameters, deterministic
  * Update: compute updated parameters per Policy Gradient

Also rework the Random policy to be a bit more random, it now alters
both the CSEs performed and the order they are performed in.

Add the ability to have jit config options that specify sequences of ints
or doubles.

Add the ability to just dump metric info for a jitted method, and add
more details (perhaps considerably more) for CSEs. This is all still
simple text format.

Also factor out a common check for "non-viable" candidates -- these are
CSE candidates that won't actually be CSEs. This leads to some minor
diffs as the check is now slightly different for CSEs with zero uses
and/or zero weighted uses.

Contributes to dotnet#92915.
@AndyAyersMS
Copy link
Member Author

For reasons I find hard to justify I decided my $V(s)$ and $Q(s,a)$ graph needed colormapping. This is from the exploration done by Policy Gradient. Here (yellow) doing all 3 CSEs is best, in any order.

Top number in each state is min perf score, bottom is the average. As the heuristic evolves one hopes the average approaches the min, as the fumbling initial steps are more or less forgotten (image below is after 100 runs).

Double-circle indicates what the current jit CSE heuristic does.

image - 2024-01-22T172846 515

AndyAyersMS added a commit that referenced this issue Jan 29, 2024
)

Adds special CSE heuristic modes to the JIT to support learning a good CSE
heuristic via Policy Gradient, a form of reinforcement learning. The learning
must be orchestrated by an external process, but the JIT does all of the
actual gradient computations.

The orchestration program will be added to jitutils. The overall process
also relies on SPMI and the goal is to minimize perf score.

Introduce two new CSE heuristic policies:
* Replay: simply perform indicated sequence of CSEs
* RL: used for the Policy Gradient, with 3 modes:
  * Stochastic: based on current parameters but allows random variation
  * Greedy: based on current parameters, deterministic
  * Update: compute updated parameters per Policy Gradient

Also rework the Random policy to be a bit more random, it now alters
both the CSEs performed and the order they are performed in.

Add the ability to have jit config options that specify sequences of ints
or doubles.

Add the ability to just dump metric info for a jitted method, and add
more details (perhaps considerably more) for CSEs. This is all still
simple text format.

Also factor out a common check for "non-viable" candidates -- these are
CSE candidates that won't actually be CSEs. This leads to some minor
diffs as the check is now slightly different for CSEs with zero uses
and/or zero weighted uses.

Contributes to #92915.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 6, 2024
Some initial attempts to improve the modelling of "stopping early" when
doing CSEs. This adds a simplistic register pressure estimate modelled
on the one we have now, where we find the weight of the Nth tracked local
and use that as a reference weight for deciding when a CSE might start
becoming costly.

Contributes to dotnet#92915.
@AndyAyersMS
Copy link
Member Author

At this point I'm mainly watching how the PG evolves and looking at methods that it doesn't handle well, trying to understand what about the candidates in those cases is different/unusual, to see if it's a mistaken feature value, a missing feature value, or whatnot.

An alternative idea is to use MCMC (or whatever) to broadly classify candidates as good or bad (since for the most part order does not seem to matter, so goodness/badness is not contextual, and features don't evolve much over the course of an episode). Good if they are part of the best CSE sequence, and bad if not. Then grab all these candidate features, and compute some sort of nearness/clustering analysis in the feature space (or build a classifier), to see if we can distinguish good or bad from the features; if we can't, then how can we expect a policy to distinguish them?

If we find a pair of candidates with similar features where one is good and the other is bad, study those cases to figure out what is going on.

@AndyAyersMS
Copy link
Member Author

Also intrigued by the ideas in Simple random search provides a competitive approach
to reinforcement learning
which operates in the parameter space directly. Might be an interesting alternative to implement, and is similar to things I've tried in the past.

@AndyAyersMS
Copy link
Member Author

Also for features, once we have decent/stable results we can try dropping features from the set to see how much the (lack of) a feature impairs overall results.

Might also search for correlated features and try and decorrelate them (eg gram-schmidt orthogonalization or similar). Say the value of feature A predicts value of feature B pretty well, then it might be worth subtracting of the predicted B (via A) from the actual B, so the reported features are A and B-(A's prediction of B).

The two ideas above are similar, if A and B are correlated then having them both provides little additional benefit over having just one of them.

AndyAyersMS added a commit that referenced this issue Feb 10, 2024
Based on analysis of cases where the machine learning is struggling, add some more observations
and tweak some of the existing ones:
* where we use `log` for dynamic compresson, bias results to they are always non-negative
* only consider integral vars for pressure estimate
* note if a CSE has a call
* note weighted tree costs
* note weighted local occurrences (approx pressure relief)
* note spread of occurrences (as fraction of BBs)
* note if CSE is something that can be contained (guess)
* note if CSE is cheap (cost 2 or 3) and is something that can be contained
* note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date.
Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to #92915.
@AndyAyersMS
Copy link
Member Author

Have been looking into improving efficiency. The PolicyGradient has two phases -- a one-off where it runs method by method (using Parallel.For to launch multiple SPMI instances), and a batch where it processes them all (using SPMI's built in parallelism).

;; method by method
 6231 methods in 45716ms
;; batch
35482 methods in 45099ms

So the method by method has ~6x overhead...

I have tried building all the stuff needed into release, it is (perhaps) nearly there, but that won't explain the above, as both versions are running checked jits.

The batch mode is what I'm used to on this box, about 1K methods/sec.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 21, 2024
Revise things a bit so that we can run the greedy ML heuristic in release mode,
with a built-in set of "good" parameters.

These parameters come from a policy gradient run over a 200 method training
set, on the asp.net windows x64 collection.

Contributes to dotnet#92915.
AndyAyersMS added a commit that referenced this issue Feb 21, 2024
…8729)

Revise things a bit so that we can run the greedy ML heuristic in release mode,
with a built-in set of "good" parameters.

These parameters come from a policy gradient run over a 200 method training
set, on the asp.net windows x64 collection.

Contributes to #92915.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 21, 2024
Add some minimal CI testing for this new CSE heuristic.

Remove testing of cross-block assertion prop, since it's now on by default.

Contributes to dotnet#92915.
AndyAyersMS added a commit that referenced this issue Feb 23, 2024
Add some minimal CI testing for this new CSE heuristic.

Remove testing of cross-block assertion prop, since it's now on by default.

Contributes to #92915.
@AndyAyersMS
Copy link
Member Author

Here's a trial that enable the new heuristic: #98776

Have added some notes there on the behaviors.

AndyAyersMS added a commit to AndyAyersMS/jitutils that referenced this issue Feb 24, 2024
Add a new `--optimizeSize` option where PolicyGradient tries to reduce
code size instead of perf score.

Since there are now two possible optimization objectives we have the
potential for trading off one for the other.  Update MCMC to track the
"pareto frontier" for the methods it explores, and save the data for
visualization.

Contributes to dotnet/runtime#92915.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 25, 2024
Profiling showed that `GetFeatures` was a major factor in throughput. For the
most part the features of CSE candidates don't change as we perform CSEs, so
build in some logic to avoid recomputing the feature set unless there is some
evidence features have changed.

To avoid having to remove already performed candidates from the candidate vector
we now tag them them as `m_performed`l  these get ignored during subsequent processing,
and discarded if we ever recompute features.

This should cut the TP impact roughly in half, the remaining part seems to
largely be from doing more CSEs (which we hope will show some perf benefit).

Contributes to dotnet#92915.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 26, 2024
Profiling showed that `GetFeatures` was a major factor in throughput. For the
most part the features of CSE candidates don't change as we perform CSEs, so
build in some logic to avoid recomputing the feature set unless there is some
evidence features have changed.

To avoid having to remove already performed candidates from the candidate vector
we now tag them them as `m_performed`l  these get ignored during subsequent processing,
and discarded if we ever recompute features.

This should cut the TP impact roughly in half, the remaining part seems to
largely be from doing more CSEs (which we hope will show some perf benefit).

Contributes to dotnet#92915.
AndyAyersMS added a commit that referenced this issue Feb 26, 2024
Profiling showed that `GetFeatures` was a major factor in throughput. For the
most part the features of CSE candidates don't change as we perform CSEs, so
build in some logic to avoid recomputing the feature set unless there is some
evidence features have changed.

To avoid having to remove already performed candidates from the candidate vector
we now tag them them as `m_performed`l  these get ignored during subsequent processing,
and discarded if we ever recompute features.

This should cut the TP impact roughly in half, the remaining part seems to
largely be from doing more CSEs (which we hope will show some perf benefit).

Contributes to #92915.
AndyAyersMS added a commit to dotnet/jitutils that referenced this issue Feb 29, 2024
Add a new `--optimizeSize` option where PolicyGradient tries to reduce
code size instead of perf score.

Since there are now two possible optimization objectives, we have the
potential for trading off one for the other. Update MCMC to track the
"pareto frontier" for the methods it explores and save the data for
visualization.

Contributes to dotnet/runtime#92915.
@AndyAyersMS AndyAyersMS modified the milestones: 9.0.0, Future May 3, 2024
@JulieLeeMSFT JulieLeeMSFT moved this to Team User Stories in .NET Core CodeGen Jun 5, 2024
@JulieLeeMSFT
Copy link
Member

Closing as activities during .NET 9 is done.
We will revisit work to do during .NET 10 planning.

@github-project-automation github-project-automation bot moved this from Team User Stories to Done in .NET Core CodeGen Jul 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Projects
Status: Done
Development

No branches or pull requests

4 participants