Investigate improving JIT heuristics with machine learning #92915

AndyAyersMS · 2023-10-02T23:38:21Z

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Here's a closely relevant paper and github repo: MLGO: a Machine Learning Guided Compiler
Optimizations Framework. Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).

https://github.com/google/ml-compiler-opt

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):

Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations.
- See initial notes below.
Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc).
See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed.
- See notes below for some initial data.
Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models.
- I have been working on a Policy Gradient approach. The implementation is split with part of it in the JIT and the rest in a driver program. Documentation is lacking, so ping me if you want to try this at home.
  - JIT part was introduced in JIT: initial support for reinforcement learning of CSE heuristic #96880
  - Driver part in Initial version of a tool for ML on CSEs jitutils#389
  - More work on stopping: JIT: stopping preference for ML CSE heuristic #98063
  - More work on the driver: Fix RLCSE for new metrics format jitutils#391
  - More features JIT: More CSE heuristics adjustments #98257
  - More work on the driver: RLCSE: fix MCMC and GatherFeatures, overwrite dumps jitutils#395
  - Streaming SPMI mode JIT: Streaming mode for SPMI #98440
  - And client updates to use that mode: RLCSE: streaming SPMI mode jitutils#397
  - Enable running in release with fixed parameter set: JIT: refactor CSE to allow running greedy ML heuristic in release #98729
  - actually try enabling (draft): JIT: temporarily enable RLCSEGreedy to see how it fares in CI #98776
See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64.
Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT.
Set up a CI job to run the Policy Gradient
- JIT: test the new cse policy in jit-experimental #98777 (running the policy, not the training)
Set up a Perf lab experiment to run the current "best" greedy policy on benchmarks. Tis should be running now, will check on the data in a few days
- Enable RLCSE experiment in the perf lab #98779
- Add setup for RLCSE experiment performance#3975
  Also worth noting is Lee Culver's work to adapt the JIT CSE problem into a more standard gymnasium setting: JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Update May 2024.

Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:

poor correlation of perf score with actual perf
few benchmarks whose scores critically depend on CSEs, or
those benchmarks whose perf does critically depend on CSEs are all easy/obvious cases
difficulty extracting small signal from noise (still lots of noisy benchmark results)

The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:

Best/base: 0.9913 [optimizing for  score]
vs Base    0.9957 Better 440 Same 1476 Worse 84
vs Best    1.0044 Better 0 Same 1607 Worse 393

Params     0.6093, 0.7750,-0.5701,-0.6827, 0.5060,-0.0514,-1.3563, 0.4515,-2.0999, 0.0000,-1.0623,-1.3723, 0.0000,-0.6143,-0.8286,-0.2956,-1.1433, 0.3418, 1.3964,-0.0043,-0.4237, 0.6097,-1.9773,-0.3684, 0.7246

Collecting greedy policy data via SPMI... done (27583 ms)
Greedy/Base (score): 34965 methods, 8375 better, 24792 same, 1797 worse,  0.9960 geomean
Best:   76679 @  0.7731
Worst:  82000 @  1.4512

Greedy/Base (size): 34965 methods, 4628 better, 24694 same, 5642 worse,  1.0005 geomean
Best:   15489 @  0.7000
Worst:  82000 @  1.4205

We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.

The text was updated successfully, but these errors were encountered:

ghost · 2023-10-02T23:38:29Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):

Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations.
Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc).
See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed.
Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models.
See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64.
Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT.

cc @dotnet/jit-contrib

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	9.0.0

AndyAyersMS · 2023-10-02T23:41:48Z

Initial Notes on the Current JIT CSE Heuristic

Does the initial frame size computation hold up well? Can we justify the 3/2/1 formula in CSEHeuristic::Initialize?
a. Disable CSE, see how well number of tracked locals in registers maps up with the prediction. If not, see if there is some way to make a better prediction.
b. Then slowly enable CSE, and see if this continues to hold up well.
c. See how well frame size matches estimates; if not good, see if we can improve this somehow
Do the aggressive/moderate thresholds make sense? That is, should they be higher or lower? Based on something else?
a. trickier to see how to model this; perhaps keep processing CSEs and see when spilling starts?
(that is, do a local sensitivity analysis) -- might be counterbalanced by the promotion code.
b. how about the overall approach of setting these thresholds based on IR observations? Seems sensible but maybe there are other ideas to try?
c. at the end of CSEHeuristic::Initialize we increase the thresholds if they seem to low. Why?
d. also doesn't that increase mix up weighted and unweighted cases? How does it make sense to compare unweighted ref counts with BB_UNITY_WEIGHT?
e. should we try some “true” register pressure estimate, or is that hopeless? We also don’t know how the pressure points intersect with potential CSE temp liveness.
Is the assumption here that FP CSEs are rare true?
How reliable are CostSz and CostEx?
a. probably a can of worms... how would we even go about validating these?
Should candidates be ranked by local benefit, or by global benefit, or some ratio of benefit / cost?
a. That is, why doesn’t Compiler::optCSEcostCmpEx consider the csdUseWtCnt more prominently? Seems like the right figure of merit (when optimizing for speed is csdUseWtCnt * tree->CostEx()
so a relatively “cheap” CSE that heavily used is preferable to a “costly” CSE that is not as heavily used.
b. Should this factor in code size even in speed modes? Assuming CSEs generally decrease size a bit, we might want to bias the above towards the heavier ones slightly.
c. Do the threshold increases in PerformCSE make sense? This gives a budget-like aspect to the ranking, so the order does matter. What if we ignored this?
Are we missing candidates?
a. review bail outs in Compiler::optIsCSEcandidate.
b. review limitations in Compiler::optValnumCSE_Locate
In ConsiderCandidates is the computation sensible?
We have 3 models x 2 modes ({aggressive/moderate/conservative} x {size, speed}).
Plus some stress modes (including random...)
a. seems like we mix weighted/unweighted costs up here at times. eg extra_no_cost is unweighed (size) based, but we add it to no_cse_cost, which can be weighted (speed) based.
b. various magic numbers and twisty decision logic
c. various per-arch costs, some of them look dated
Is the "shared constant" CSE code sensible? Here we disregard the low bits of constants and at use
points we rematerialize the missing bits via an add.
a. So use cost is a bit higher... but this does not seem to be modelled by the heuristics ?
b. Is the "best base" value computation reasonable? Seems like we're trying to use small immediates here?

Introduce two config vars for experimenting with CSEs: * `JitCSEHash`: identifies a method for CSE experimentatin * `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt") When the hash is nonzero, any method whose hash matches will perform only the subset of CSEs specified by the mask (up to 32 CSEs). Also introduce a config var to dump the total number of CSEs to either the assemby listing footer or the one-liner from the disassembly summary. * `JitMetrics` This can perhaps be generalized eventually to report more metrics and perhaps to report them back to SPMI when it is the jit host. Finally, note CSE lcl vars that represent hoisted trees and or are "multiple-def" CSEs in the local var table. Contributes to dotnet#92915.

AndyAyersMS · 2023-10-03T01:57:12Z

Perf vs Perf Score

Here's a chart of about 3000 measurements across 50 or so benchmarks, varying the set of CSEs we do in the hottest method. Each data point is the ratio of the perf vs the default JIT perf, vs the ratio of the perf score vs the default JIT perf score. Thus values above 1.0 on the y-axis are benchmarks that ran more slowly, and values to the right of 1.0 on the x-axis are benchmarks that were predicted to run more slowly.

We see most data in the "slower and predicted slower" (upper right) quadrant because CSEs generally improve perf, so doing less of them than we'd normally do will likely slow down the benchmark.

Linear fit has a correlation coefficient of about 0.3, so there is a fair amount of unexplained variation in perf. Also I have excluded one benchmark's data from this as it seems anomalous.

Ideally, we'd be able to get a better correlation out of perf scores, since these can be produced very quickly as we vary JIT behavior, on the order of 1000 experiments per second. If we actually run benchmarks then it takes around 30 seconds for one experiment, so there is a massive benefit to leveraging perf scores.

The main contributors to poor perf predictability are thought to be (in no particular order):

errors in perf score per-block computations: wrong latency for instructions, lack of modelling dependence, other microarchitectural effects. Some of this we may be able to fix by invoking tools like llvm-mca to either give us accurate data or help us spot errors in our modelling. And that in turn requires making the JIT disasm be something readily parseable by standard tools.
errors in block weights. Even though we have PGO data we may no longer have accurate block weights at the end of compilation. This is something we can improve, though fundamentally the JIT will be unable to correctly anticipate how weights may change because of optimizations, so some amount of error here is inevitable.
"noisy" benchmarks. We know many benchmarks show considerable within-run or cross-run variation. Some of these sources of noise include: intel JCC errata, alignment sensitivity of data, other subtle microarchitectural effects (say branch collision)
errors in PGO data. Note there is some deliberate randomness and some inherent randomness in PGO data. There may also be errors in the instrumentation, collection, or reconstruction processes. Also included here are potential phased behaviors by benchmarks (where PGO data does not reflect the actual benchmark block weights because the data was measured too early)

So the task here is to dig into these issues and see if and how perf scores can be improved.

Also I need to look into why MaximizeSchwarzCriterion has anomalous results; including data from this benchmark drops R^2 to essentially zero.

ShreyasJejurkar · 2023-10-03T09:26:07Z

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

AndyAyersMS · 2023-10-03T18:42:19Z

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

I think I may have read some papers by the Graal folks where they leveraged ML, but I don't have the references handy. Presumably they'd be somewhere in the big list of papers I referred to above.

Introduce two config vars for experimenting with CSEs: * `JitCSEHash`: identifies a method for CSE experimentatin * `JitCSEMask`: specifies a bitmask of allowed CSEs (by "attempt") When the hash is nonzero, any method whose hash matches will perform only the subset of CSEs specified by the mask (up to 32 CSEs). Also introduce a config var to dump the total number of CSEs to either the assemby listing footer or the one-liner from the disassembly summary. * `JitMetrics` This can perhaps be generalized eventually to report more metrics and perhaps to report them back to SPMI when it is the jit host. Finally, note CSE lcl vars that represent hoisted trees and or are "multiple-def" CSEs in the local var table. Contributes to #92915.

AndyAyersMS · 2023-10-06T18:41:14Z

Have been digging into why the PerfScores have such poor correlation on MaximizeSchwarzCriterion and it seems like it is a combination of a few effects:

Scalable profiles are notably approximate and will vary from run to run
The exact timing of tiering up can vary, so the profile data will also vary from run to run
Block weight reconstruction is ill-conditioned (small input changes can lead to large output changes)
Methods that rely on OSR can end up with partial profiles. In particular for Tier0-instr methods that are called and never return we often do not have profile weight for the entry block since that is deduced from the returns.

Thus the block weights (in particular the large weights for innermost loop blocks) can vary quite a bit from one run to the next, and that seems to be the major factor. You can see this by just running the benchmark multiple times with the same jit configuration:

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,649125874.85,1.000,59311.6714,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,80505239.17,0.124,61848.4732,1.043
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1532,421090857.67,0.649,65608.5883,1.106
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,1307172767.70,2.014,59236.0308,0.999
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1584,108418141.52,0.167,65119.4767,1.098

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,100478799.14,1.000,61020.7827,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,69888793.36,0.696,63267.6950,1.037
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1530,69621450.93,0.693,66241.6050,1.086
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,108183513.60,1.077,58789.5833,0.963
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1582,70447641.30,0.701,64906.9033,1.064

In particular the first line of each set above shows two "identical" runs with default JIT behavior, the first took 59,311us with PerfScore 649,125,874.85 and the second 61,020us with PerfScore 100,478,799.14. So perf varied by about 1.03x and PerfScore about 6x.

For the purposes of this investigation, we can just exclude methods that have OSR methods.

This variation in weights is possibly less of a problem in practice though it may be a contributor to the kind of fluctuating performance we noted in #87324. We should think about better strategies to counteract it:

more aggressive blending? We currently use 0.99 for blend factor, perhaps something more like 0.90?
Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?
adding extra probes based on patchpoint location that can recover missing counts. The interesting case is the one where the patchpoint helper transitions to the OSR method—the helper would need to increment the counter. And since this probe would not normally appear in the schema we'd have to figure out how to fit it in without disrupting anything else. We do something like this for throws already, but those are easy to anticipate.

MichalPetryka · 2023-10-06T18:55:08Z

Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

AndyAyersMS · 2023-10-06T20:05:45Z

Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

Yes -- the idea there was to avoid needing to go from R2R (optimized) -> Tier0 (unoptimized) to gather PGO data. Here the idea would be to go from Tier0+instr -> (OSR) -> Tier1+ instr -> Tier1, for methods that are called infrequently (but more than once) and contain hot loops. The second instrumented run is there to fill out the missing bits of profile in the parts of the method that currently only run under OSR.

AndyAyersMS · 2023-12-19T18:52:00Z

Some more thoughts on applying ML to the CSE heuristic.

My current thinking is to approach this as a model-based replacement learning application. We think we know which CSEs are good ones (our initial model) and we don't have any repository of known good results (aka labels), and possibly may incur significant costs in obtaining data. Our goal is to improve or find a better model.

For the initial cut we can just rely on perf scores (flawed as they may be) as the overall approach won't be that impacted; if and when we have it working well, we can think about mixing in actual measurements. Perf scores (if viable) offer a lot of advantages including the ability to assess results on methods that are not prominent in any benchmarks.

As a prelude to all that I want to elaborate the current model and figure out the right way to parameterize it. Currently CSE candidates are ranked by a sort function, and then attempted in rank order. As CSEs are performed the acceptance thresholds ratchet upwards so there is an order dependence. So we might hope that the reinforcement-learned model could be expressed as some numeric score with say higher values indicating higher preferences and lower values lower ones, and perhaps zero or negative values indicating CSEs that should not be done.

The ratcheting is an attempt to model register pressure. The thresholds are loosely set based upon properties of the weighted local var counts (here I am ignoring for now the "optimize for size" mode), eg the aggressive cutoff is the weight of the 13th highest ranked local var (pre-CSE). More broadly one can imagine that the model would have access to general properties of the lcl var weight distribution, expressed as a sort of CDF, eg we would like to know roughly how many lcl vars do we have to consider to reach say 90% of all (weighted) local var occurrences.

Broadly speaking a particular CSE we will add a new local var whose weight we can compute (from the def and use weights) and remove weight from any of the locals. The current heuristic roughly estimates the first (via the ratchet) but does not take into account the second. Thus for instance if we CSEd a large computation tree to affect a hoist within a loop we might actually expect the overall register pressure to decrease. Thinking of this all as a sort of Markov Decision Process, as we do one CSE the viability of the others will shift.

If we want the model to handle this sort of order dependence, then after each CSE we need to re-evaluate the scores of the remaining candidates.

I will note along these lines that the current CSE heuristic has some quirks that seem interesting. For example in a simple case with just two candidates A & B, if we order them A, B we do both CSEs, if we order them B, A, we do just B, because B is is call-crossing and has a higher ratchet and makes A look unattractive. This struck me as odd.

So, factors that might feed into the model:

current distribution of local var weights
- perhaps per register class (?) right now we just look at int regs
- with callee saves and call crossing CSEs taken into account
impact on this distribution for each candidate
so overall "score" includes benefit from the CSE plus anticipated extra costs in prolog & for spills
some kind of code complexity metric (maybe: how likely is the distribution going to lead to spills)
whether this CSE is GTF_MAKE_CSE (presumably those do better)
current cost data (weighted and unweighted costs, number of defs & uses, etc)

Some preliminary sub problems to examine:

what do these weighted local distributions look like? Can we model them simply somehow?
do we need a pressure heuristic?
do we see cases now where even with the current pressure heuristic we CSE too much?
- modify existing policy to keep going, and look for cases where perf score increases
how good is the current heuristic
- CSE randomly and see how well the current model fares overall versus the empirical distribution
- here the figure of merit might be perf score / no CSE perf score

AndyAyersMS · 2023-12-19T22:45:43Z

Here is some data on the current heuristc, from the asp.net SPMI collection [full table here]

115838 methods, 34802 methods with cses; 193807 cse candidates, 56973 cses
NumCandiates [count, odds]: ... details ...
 1 [n: 10259 o: 0.53]:  0: 4836;  1: 5423
 2 [n:  4913 o: 0.44]:  0: 1760;  1: 2028;  2: 1125
 3 [n:  4058 o: 0.36]:  0: 1351;  1: 1329;  2: 1031;  3:  347
 4 [n:  3454 o: 0.37]:  0:  788;  1: 1100;  2:  860;  3:  570;  4:  136
 5 [n:  2576 o: 0.35]:  0:  586;  1:  497;  2:  792;  3:  470;  4:  175;  5:   56
 6 [n:  1627 o: 0.31]:  0:  264;  1:  452;  2:  474;  3:  182;  4:  227;  5:   21;  6:    7
 ...

I was surprised to see such low odds for accepting candidates, on average only about 0.25. So there is certainly headroom to be more aggressive.

For these methods with small numbers of candidates it should be feasible to exhaustively enumerate all the possible sets of CSEs and look at the resulting spectrum of perf scores, and chart out how well the existing heuristic performs (as measured by perf score) vs the best score. For larger sets we can at least attempt to gather a decent sample of data.

Working on that now.

AndyAyersMS · 2023-12-20T04:01:31Z

Here is some initial data on the performance of the current heuristic.

Using the asp.net SPMI collection I sorted 100 candidates from each of the categories above, and then ran $2^n + 1$ experiments: the baseline run and then jitted once for each different subset of CSEs possible (say there were 5 CSE candidates, this is 33 runs), and then looked at how often the default heuristic got the best possible perf score, and the geomean ratio of the default perf score to the best across those 100 runs.

The preliminary data looks like this:

  --- 1 candidate cases: default heuristic was optimal in 90 of 100 cases; geomean 1.003
  --- 2 candidate cases: default heuristic was optimal in 81 of 100 cases; geomean 1.006
  --- 3 candidate cases: default heuristic was optimal in 65 of 100 cases; geomean 1.008
  --- 4 candidate cases: default heuristic was optimal in 62 of 100 cases; geomean 1.006
  --- 5 candidate cases: default heuristic was optimal in 50 of 100 cases; geomean 1.012
  --- 6 candidate cases: default heuristic was optimal in 49 of 100 cases; geomean 1.006

So the rough trend is that as the number of candidates increases, the likelihood of the default heuristic getting the best possible perf score decreases, and in aggregate the default heuristic falls short of the best perf score by around 0.6% or so.

This data took a few minutes to gather; I'll allow it to run on more cases overnight.

I have some rough edges to polish here and to explore some of the even larger candidate cases I'll need some kind of monte-carlo strategy as the exponential number of runs start becoming a factor.

AndyAyersMS · 2023-12-21T18:15:29Z

Here is the "complete" set of results from asp.net, up through cases with 11 cse candidates. My current approach is going to take to long to get data for higher numbers of candidates, eg for an 11 candidate case we need 2^11 runs of SPMI over the method context varying the CSE mask, so the entire set of 526 case required over 1 million runs.

  ---  1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005
  ---  2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007
  ---  3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006
  ---  4 candidate cases: baseline heuristic was optimal in 2087 of  3454 cases 60.42%; geomean 1.005
  ---  5 candidate cases: baseline heuristic was optimal in 1382 of  2576 cases 53.65%; geomean 1.010
  ---  6 candidate cases: baseline heuristic was optimal in  825 of  1627 cases 50.71%; geomean 1.006
  ---  7 candidate cases: baseline heuristic was optimal in  492 of   999 cases 49.25%; geomean 1.006
  ---  8 candidate cases: baseline heuristic was optimal in  522 of   979 cases 53.32%; geomean 1.004
  ---  9 candidate cases: baseline heuristic was optimal in  357 of   711 cases 50.21%; geomean 1.004
  --- 10 candidate cases: baseline heuristic was optimal in  278 of   637 cases 43.64%; geomean 1.004
  --- 11 candidate cases: baseline heuristic was optimal in  236 of   526 cases 44.87%; geomean 1.004

Next steps from here:

Consider making the above faster. We should be able to use a release jit (if we make perf score and other metrics available) and run the various cases concurrently. That would speed up the data gathering by maybe 20x. We could also fan out to multiple machines...
In some of the runs above the default heuristic is able to get a better perf score than any of the masked variants. Figure out why.
In some of the runs above the SPMI replay fails with some CSE subsets. The ones I looked at were missing helper call infos for write barrier helpers. See if we can perhaps extend the SPMI collection to make replay more robust.
I have been assuming there is no order dependence in in the masked CSE replays. Say there are 3 candidates: the above process will do 8 runs: default, {}, {1}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}. But if there is order dependence, we should also be running the various permutations: {2, 1}, {3, 1}, {3, 2}, {1, 3, 2} {2, 1, 3}, {2, 3, 1}, {3, 1, 2}, {3, 2, 1}. We might need to do this anyways since we will likely be estimating the reward (benefit) of the CSE via some sort of "temporal difference" scheme. At any rate we should validate the order dependence or independence (order dependence may explain point 2 above).
Even if there is no order dependence in results there might be one in modelling. Consider the case of a nested CSE. If 2 nests within 1, then we get slightly different IR if we do {1, 2} vs {2, 1} -- abstractly this should not matter for codegen but in practice it might. But when looking at rewards, the reward for 2 will be diminished if we do 1 first, and the reward for 1 will be diminished if we do 2 first (and it's possible that 2 becomes non-viable after 1, if all its uses are subsumed by uses of 1).
Assuming we're happy with the data gathered above, the next step is to figure out how to make us of it to craft a policy. The rough thinking is to come up with a set of observations that we think are relevant (we already have some) and build some sort parameterized evaluation function. We then redo the runs above and record the metrics at each step, and then use the reinforcement learning algorithm to gradually modify the evaluation function so that it increasingly is able to pick the optimal sequences of CSEs.
We should employ good hygene here and not over-fit, so doing some kind of 5 or 10-fold cross validation seems sensible. That means running the above 10 times.
We should also try other collections / arches /os to see how well that aspect generalizes.
All of this is "perf score optimal" and may not reflect reality, so we still need to sanity check with actual results somehow.

AndyAyersMS · 2023-12-22T19:15:12Z

I figured out what was leading to (2) above -- I'd forgotten that the config settings get parsed as hex, so for the 4+ candidate cases the tool was not properly exploring the full subset of possibilities. With that fixed I don't see any cases where the default heuristic can do better than a subset.

Updated data shows that for the 4+ cases the heuristic is now a bit worse (not surprising as some of those missing subsets were going to be good ones) and the upside opportunity is a bit larger (also not surprising). Also added numbers on how many SPMI runs are required, note as the candidate counts increase the evaluation takes exponentially more runs, but the number of candidates falls off so the growth overall is a bit milder...

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)

AndyAyersMS · 2024-01-04T19:16:59Z

Fixed some other issues; latest full run data

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)
  --- 8 candidate cases: baseline heuristic was optimal in  454 of   979 cases 46.37%; geomean 1.009 (251603 runs)

AndyAyersMS · 2024-01-06T18:17:12Z

I have a version of MCMC running for larger instances, and have verified that it matches up pretty well to exhaustive enumeration in the medium-sized cases. The trend noted above continues:

  ---  6 candidate cases: baseline heuristic was optimal in 734 of 1627 cases 45.11%; geomean 1.009 baseline win from CSE 0.960 max win 0.951 (105755 runs in    775,494ms)
  ---  7 candidate cases: baseline heuristic was optimal in 415 of  999 cases 41.54%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (128871 runs in  1,742,522ms)
  ---  8 candidate cases: baseline heuristic was optimal in 451 of  979 cases 46.07%; geomean 1.009 baseline win from CSE 0.958 max win 0.949 (251603 runs in  3,634,848ms)
  ---  9 candidate cases: baseline heuristic was optimal in 278 of  711 cases 39.10%; geomean 1.008 baseline win from CSE 0.956 max win 0.949 (182727 runs in  5,070,407ms)
  --- 10 candidate cases: baseline heuristic was optimal in 183 of  637 cases 28.73%; geomean 1.008 baseline win from CSE 0.959 max win 0.951 (163709 runs in  6,406,325ms)
  --- 11 candidate cases: baseline heuristic was optimal in 142 of  526 cases 27.00%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (135182 runs in  7,543,211ms)
  --- 12 candidate cases: baseline heuristic was optimal in 231 of  535 cases 43.18%; geomean 1.009 baseline win from CSE 0.966 max win 0.957 (137495 runs in  8,664,782ms)
  --- 13 candidate cases: baseline heuristic was optimal in  86 of  342 cases 25.15%; geomean 1.009 baseline win from CSE 0.962 max win 0.953 (87894  runs in  9,376,476ms)
  --- 14 candidate cases: baseline heuristic was optimal in  63 of  265 cases 23.77%; geomean 1.009 baseline win from CSE 0.943 max win 0.935 (68105  runs in  9,937,161ms)
  --- 15 candidate cases: baseline heuristic was optimal in  51 of  237 cases 21.52%; geomean 1.009 baseline win from CSE 0.959 max win 0.951 (60909  runs in 10,473,161ms)
  --- 16 candidate cases: baseline heuristic was optimal in  75 of  262 cases 28.63%; geomean 1.007 baseline win from CSE 0.967 max win 0.960 (67334  runs in 11,120,183ms)
  --- 17 candidate cases: baseline heuristic was optimal in  72 of  246 cases 29.27%; geomean 1.007 baseline win from CSE 0.976 max win 0.969 (63222  runs in 11,702,941ms)
  --- 18 candidate cases: baseline heuristic was optimal in  34 of  196 cases 17.35%; geomean 1.007 baseline win from CSE 0.959 max win 0.952 (50372  runs in 12,185,341ms)
  --- 19 candidate cases: baseline heuristic was optimal in  29 of  198 cases 14.65%; geomean 1.010 baseline win from CSE 0.968 max win 0.959 (50886  runs in 12,697,428ms)
  --- 20 candidate cases: baseline heuristic was optimal in  24 of  172 cases 13.95%; geomean 1.010 baseline win from CSE 0.961 max win 0.951 (44204  runs in 13,134,515ms)
  --- 21 candidate cases: baseline heuristic was optimal in   6 of  110 cases  5.45%; geomean 1.010 baseline win from CSE 0.953 max win 0.944 (28270  runs in 13,445,214ms)
  --- 22 candidate cases: baseline heuristic was optimal in  13 of   97 cases 13.40%; geomean 1.006 baseline win from CSE 0.955 max win 0.949 (24929  runs in 13,706,788ms)
  --- 23 candidate cases: baseline heuristic was optimal in   4 of   92 cases  4.35%; geomean 1.011 baseline win from CSE 0.966 max win 0.955 (23644  runs in 14,004,816ms)
  --- 24 candidate cases: baseline heuristic was optimal in   5 of  119 cases  4.20%; geomean 1.013 baseline win from CSE 0.967 max win 0.955 (30583  runs in 14,316,467ms)
  --- 25 candidate cases: baseline heuristic was optimal in   7 of   63 cases 11.11%; geomean 1.011 baseline win from CSE 0.967 max win 0.957 (16191  runs in 14,510,376ms)
  --- 26 candidate cases: baseline heuristic was optimal in   2 of   97 cases  2.06%; geomean 1.010 baseline win from CSE 0.974 max win 0.964 (24929  runs in 14,789,976ms)
  --- 27 candidate cases: baseline heuristic was optimal in   1 of   74 cases  1.35%; geomean 1.022 baseline win from CSE 0.954 max win 0.934 (19018  runs in 15,023,453ms)
  --- 28 candidate cases: baseline heuristic was optimal in   7 of   73 cases  9.59%; geomean 1.015 baseline win from CSE 0.974 max win 0.960 (18761  runs in 15,263,888ms)
  --- 29 candidate cases: baseline heuristic was optimal in   1 of   69 cases  1.45%; geomean 1.009 baseline win from CSE 0.975 max win 0.966 (17733  runs in 15,469,362ms)
  --- 30 candidate cases: baseline heuristic was optimal in  17 of   66 cases 25.76%; geomean 1.011 baseline win from CSE 0.945 max win 0.934 (16962  runs in 15,683,842ms)

Using the above I can now generate "perf-score" optimal for cases with small CSEs, and can elaborate the entire value function $V(s, a)$ either explicitly or graphically... eg

Here green shows the optimal policy, dog-eared the current policy, square boxes are terminal states.

As noted in (4) above there doesn't seem to be order-dependence in the values; if so that greatly cuts down the size of the state space. If there is order dependence then the state space grows faster than $n!$ -- actual growth is https://oeis.org/A000522. If there is no order dependence than it's just $2^n$. Still need MCMC for larger cases, but can cover a much greater fraction.

So I plan to screen all this data to see if I ever see a case where the order of CSEs matters for final perf score. If it never happens or is suitably rare, we'll just ignore it.

The data above can't directly be used to craft a policy, since the state is a specific method instance plus a sequence of CSEs. In reality the jit won't have already seen a given method... so we need to generalize the state and try building a policy that way.

My initial plan is to use a policy gradient method. We know some of the factors that should influence CSEs, these are already considered by the current policy, though perhaps not in optimal form:

weighted and unweighted costs and sizes of the CSE uses and defs
numbers of defs and uses
whether the CSE is call crossing nor not
an estimate of register pressure -- currently we use the weighed cost of the Nth integer local, where N is initially found from a scan of the local var table, and increases somewhat with each successive CSE

So we can start with these. Since there is a large dynamic range for weighed costs I am tempted to use a log to keep the data a bit more reasonable. We also need to assign some preference to stop the process, so there is one extra action to model. The initial model will likely be linear, so the preference for an action (CSE) will be the dot product of the observations vector based on that CSE and ambient state, and a parameter vector that is all numeric and provides weights for each observation. This can be generalized to include nonlinear functions of observations (or combinations of observations) if necessary.

For the reward we can use absolute perf scores, but this doesn't generate well across methods, so I am tempted to use the relative perf score over doing no CSEs (or could also be the relative score compared to the best known score).

Adds special CSE heuristic modes to the JIT to support learning a good CSE heuristic via Policy Gradient, a form of reinforcement learning. The learning must be orchestrated by an external process, but the JIT does all of the actual gradient computations. The orchestration program will be added to jitutils. The overall process also relies on SPMI and the goal is to minimize perf score. Introduce two new CSE heuristic policies: * Replay: simply perform indicated sequence of CSEs * RL: used for the Policy Gradient, with 3 modes: * Stochastic: based on current parameters but allows random variation * Greedy: based on current parameters, deterministic * Update: compute updated parameters per Policy Gradient Also rework the Random policy to be a bit more random, it now alters both the CSEs performed and the order they are performed in. Add the ability to have jit config options that specify sequences of ints or doubles. Add the ability to just dump metric info for a jitted method, and add more details (perhaps considerably more) for CSEs. This is all still simple text format. Also factor out a common check for "non-viable" candidates -- these are CSE candidates that won't actually be CSEs. This leads to some minor diffs as the check is now slightly different for CSEs with zero uses and/or zero weighted uses. Contributes to dotnet#92915.

AndyAyersMS · 2024-01-23T01:40:12Z

For reasons I find hard to justify I decided my $V(s)$ and $Q(s,a)$ graph needed colormapping. This is from the exploration done by Policy Gradient. Here (yellow) doing all 3 CSEs is best, in any order.

Top number in each state is min perf score, bottom is the average. As the heuristic evolves one hopes the average approaches the min, as the fumbling initial steps are more or less forgotten (image below is after 100 runs).

Double-circle indicates what the current jit CSE heuristic does.

) Adds special CSE heuristic modes to the JIT to support learning a good CSE heuristic via Policy Gradient, a form of reinforcement learning. The learning must be orchestrated by an external process, but the JIT does all of the actual gradient computations. The orchestration program will be added to jitutils. The overall process also relies on SPMI and the goal is to minimize perf score. Introduce two new CSE heuristic policies: * Replay: simply perform indicated sequence of CSEs * RL: used for the Policy Gradient, with 3 modes: * Stochastic: based on current parameters but allows random variation * Greedy: based on current parameters, deterministic * Update: compute updated parameters per Policy Gradient Also rework the Random policy to be a bit more random, it now alters both the CSEs performed and the order they are performed in. Add the ability to have jit config options that specify sequences of ints or doubles. Add the ability to just dump metric info for a jitted method, and add more details (perhaps considerably more) for CSEs. This is all still simple text format. Also factor out a common check for "non-viable" candidates -- these are CSE candidates that won't actually be CSEs. This leads to some minor diffs as the check is now slightly different for CSEs with zero uses and/or zero weighted uses. Contributes to #92915.

Some initial attempts to improve the modelling of "stopping early" when doing CSEs. This adds a simplistic register pressure estimate modelled on the one we have now, where we find the weight of the Nth tracked local and use that as a reference weight for deciding when a CSE might start becoming costly. Contributes to dotnet#92915.

AndyAyersMS · 2024-02-10T17:03:42Z

At this point I'm mainly watching how the PG evolves and looking at methods that it doesn't handle well, trying to understand what about the candidates in those cases is different/unusual, to see if it's a mistaken feature value, a missing feature value, or whatnot.

An alternative idea is to use MCMC (or whatever) to broadly classify candidates as good or bad (since for the most part order does not seem to matter, so goodness/badness is not contextual, and features don't evolve much over the course of an episode). Good if they are part of the best CSE sequence, and bad if not. Then grab all these candidate features, and compute some sort of nearness/clustering analysis in the feature space (or build a classifier), to see if we can distinguish good or bad from the features; if we can't, then how can we expect a policy to distinguish them?

If we find a pair of candidates with similar features where one is good and the other is bad, study those cases to figure out what is going on.

AndyAyersMS · 2024-02-10T17:09:27Z

Also intrigued by the ideas in Simple random search provides a competitive approach
to reinforcement learning which operates in the parameter space directly. Might be an interesting alternative to implement, and is similar to things I've tried in the past.

AndyAyersMS · 2024-02-10T17:20:27Z

Also for features, once we have decent/stable results we can try dropping features from the set to see how much the (lack of) a feature impairs overall results.

Might also search for correlated features and try and decorrelate them (eg gram-schmidt orthogonalization or similar). Say the value of feature A predicts value of feature B pretty well, then it might be worth subtracting of the predicted B (via A) from the actual B, so the reported features are A and B-(A's prediction of B).

The two ideas above are similar, if A and B are correlated then having them both provides little additional benefit over having just one of them.

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones: * where we use `log` for dynamic compresson, bias results to they are always non-negative * only consider integral vars for pressure estimate * note if a CSE has a call * note weighted tree costs * note weighted local occurrences (approx pressure relief) * note spread of occurrences (as fraction of BBs) * note if CSE is something that can be contained (guess) * note if CSE is cheap (cost 2 or 3) and is something that can be contained * note if CSE might be "live across" a call in LSRA block ordering The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering. Contributes to #92915.

AndyAyersMS · 2024-02-13T01:43:52Z

Have been looking into improving efficiency. The PolicyGradient has two phases -- a one-off where it runs method by method (using Parallel.For to launch multiple SPMI instances), and a batch where it processes them all (using SPMI's built in parallelism).

;; method by method
 6231 methods in 45716ms
;; batch
35482 methods in 45099ms

So the method by method has ~6x overhead...

I have tried building all the stuff needed into release, it is (perhaps) nearly there, but that won't explain the above, as both versions are running checked jits.

The batch mode is what I'm used to on this box, about 1K methods/sec.

Revise things a bit so that we can run the greedy ML heuristic in release mode, with a built-in set of "good" parameters. These parameters come from a policy gradient run over a 200 method training set, on the asp.net windows x64 collection. Contributes to dotnet#92915.

…8729) Revise things a bit so that we can run the greedy ML heuristic in release mode, with a built-in set of "good" parameters. These parameters come from a policy gradient run over a 200 method training set, on the asp.net windows x64 collection. Contributes to #92915.

Add some minimal CI testing for this new CSE heuristic. Remove testing of cross-block assertion prop, since it's now on by default. Contributes to dotnet#92915.

Add some minimal CI testing for this new CSE heuristic. Remove testing of cross-block assertion prop, since it's now on by default. Contributes to #92915.

AndyAyersMS · 2024-02-23T19:51:26Z

Here's a trial that enable the new heuristic: #98776

Have added some notes there on the behaviors.

Add a new `--optimizeSize` option where PolicyGradient tries to reduce code size instead of perf score. Since there are now two possible optimization objectives we have the potential for trading off one for the other. Update MCMC to track the "pareto frontier" for the methods it explores, and save the data for visualization. Contributes to dotnet/runtime#92915.

Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.

Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to #92915.

Add a new `--optimizeSize` option where PolicyGradient tries to reduce code size instead of perf score. Since there are now two possible optimization objectives, we have the potential for trading off one for the other. Update MCMC to track the "pareto frontier" for the methods it explores and save the data for visualization. Contributes to dotnet/runtime#92915.

JulieLeeMSFT · 2024-07-30T06:05:15Z

Closing as activities during .NET 9 is done.
We will revisit work to do during .NET 10 planning.

AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 2, 2023

AndyAyersMS added this to the 9.0.0 milestone Oct 2, 2023

AndyAyersMS self-assigned this Oct 2, 2023

AndyAyersMS mentioned this issue Oct 3, 2023

JIT: Allow experimenting with different CSE subsets #92918

Merged

JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Oct 6, 2023

JulieLeeMSFT mentioned this issue Oct 18, 2023

Usage of BOLT (the LLVM post-optimizer) #93626

Open

AndyAyersMS mentioned this issue Jan 12, 2024

[JIT] PerfScore includes code size impact? #96878

Closed

AndyAyersMS mentioned this issue Jan 22, 2024

JIT: initial support for reinforcement learning of CSE heuristic #96880

Merged

AndyAyersMS mentioned this issue Feb 10, 2024

JIT: More CSE heuristics adjustments #98257

Merged

AndyAyersMS mentioned this issue Feb 21, 2024

JIT: refactor CSE to allow running greedy ML heuristic in release #98729

Merged

This was referenced Feb 21, 2024

JIT: test the new cse policy in jit-experimental #98777

Merged

Enable RLCSE experiment in the perf lab #98779

Merged

AndyAyersMS added a commit that referenced this issue Feb 23, 2024

JIT: test the new cse policy in jit-experimental (#98777)

52c14fb

Add some minimal CI testing for this new CSE heuristic. Remove testing of cross-block assertion prop, since it's now on by default. Contributes to #92915.

AndyAyersMS mentioned this issue Feb 24, 2024

MLCSE: lay groundwork for optimizing for size dotnet/jitutils#399

Merged

AndyAyersMS mentioned this issue Feb 25, 2024

JIT: improve throughput of the RLCSE greedy heuristic #98906

Merged

AndyAyersMS modified the milestones: 9.0.0, Future May 3, 2024

leculver mentioned this issue May 3, 2024

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Merged

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Team User Stories in .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT closed this as completed Jul 30, 2024

github-project-automation bot moved this from Team User Stories to Done in .NET Core CodeGen Jul 30, 2024

github-actions bot locked and limited conversation to collaborators Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate improving JIT heuristics with machine learning #92915

Investigate improving JIT heuristics with machine learning #92915

AndyAyersMS commented Oct 2, 2023 •

edited

Loading

ghost commented Oct 2, 2023

Overview

Proposed Work

AndyAyersMS commented Oct 2, 2023 •

edited

Loading

AndyAyersMS commented Oct 3, 2023 •

edited

Loading

ShreyasJejurkar commented Oct 3, 2023

AndyAyersMS commented Oct 3, 2023

AndyAyersMS commented Oct 6, 2023

MichalPetryka commented Oct 6, 2023

AndyAyersMS commented Oct 6, 2023

AndyAyersMS commented Dec 19, 2023

AndyAyersMS commented Dec 19, 2023 •

edited

Loading

AndyAyersMS commented Dec 20, 2023 •

edited

Loading

AndyAyersMS commented Dec 21, 2023

AndyAyersMS commented Dec 22, 2023 •

edited

Loading

AndyAyersMS commented Jan 4, 2024

AndyAyersMS commented Jan 6, 2024

AndyAyersMS commented Jan 23, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 13, 2024

AndyAyersMS commented Feb 23, 2024

JulieLeeMSFT commented Jul 30, 2024

Investigate improving JIT heuristics with machine learning #92915

Investigate improving JIT heuristics with machine learning #92915

Comments

AndyAyersMS commented Oct 2, 2023 • edited Loading

Overview

Proposed Work

ghost commented Oct 2, 2023

Overview

Proposed Work

AndyAyersMS commented Oct 2, 2023 • edited Loading

Initial Notes on the Current JIT CSE Heuristic

AndyAyersMS commented Oct 3, 2023 • edited Loading

Perf vs Perf Score

ShreyasJejurkar commented Oct 3, 2023

AndyAyersMS commented Oct 3, 2023

AndyAyersMS commented Oct 6, 2023

MichalPetryka commented Oct 6, 2023

AndyAyersMS commented Oct 6, 2023

AndyAyersMS commented Dec 19, 2023

AndyAyersMS commented Dec 19, 2023 • edited Loading

AndyAyersMS commented Dec 20, 2023 • edited Loading

AndyAyersMS commented Dec 21, 2023

AndyAyersMS commented Dec 22, 2023 • edited Loading

AndyAyersMS commented Jan 4, 2024

AndyAyersMS commented Jan 6, 2024

AndyAyersMS commented Jan 23, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

AndyAyersMS commented Feb 13, 2024

AndyAyersMS commented Feb 23, 2024

JulieLeeMSFT commented Jul 30, 2024

AndyAyersMS commented Oct 2, 2023 •

edited

Loading

AndyAyersMS commented Oct 2, 2023 •

edited

Loading

AndyAyersMS commented Oct 3, 2023 •

edited

Loading

AndyAyersMS commented Dec 19, 2023 •

edited

Loading

AndyAyersMS commented Dec 20, 2023 •

edited

Loading

AndyAyersMS commented Dec 22, 2023 •

edited

Loading