-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Run profile repair after frontend phases #111915
JIT: Run profile repair after frontend phases #111915
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
With this change, I'm seeing LSRA introduce more branches from critical edge splitting, which I wasn't expecting. I'll dig into this |
Profile synthesis currently isn't all that diligent in retaining the method's entry weight; if we are retaining likelihoods, it will use |
Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.
With #111971 merged in, I'm not seeing profile repair overwrite cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs and TP impact show that profile repair needs to kick in pretty often in our PGO collections by the time we get to the backend. I'd expect that having profile repair use a profile-aware DFS tree wouldn't change behavior, and then we could reuse this throughout LSRA (and layout, if LSRA doesn't introduce new blocks); this might bring down the TP cost a little. Thanks! |
@AndyAyersMS no rush on this, but since I have a few high-churn layout PRs coming up, would you prefer to see the profile changes go in first, or should we get the 3-opt consolidation work in first? I have no preference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's plan on taking this one first.
There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?
The jit-analyze summaries aren't all that enlightening... I think it's just a matter of numbers, in this case: on win-x64, only 45k methods make it to FullOpts. It's 81k methods on win-x86, and 117k methods on linux-x64. For these methods that are optimized on some platforms but not others, I suspect the ratio of size regressions to size improvements is larger, as the growth in regressions outpaces the increase in optimized methods. |
That also seems strange... I would not expect the SPMI method mix to vary this widely this across platforms. So I'm not sure what to make of the results anymore. |
After offline discussion, I opened #112380 to track this issue of different numbers of duplicated contexts across platforms. |
Unsurprisingly, synthesis is prone to taking multiple iterations to converge for OSR methods, which typically leads to loop weights being flattened by increasingly blended likelihoods. It sounds counterintuitive. but I think if we tell profile consistency to retain edge likelihoods, we should avoid blending likelihoods at all, even if it means we cannot re-establish consistency. I suspect we're better off with a moderately inconsistent profile that still resembles the PGO data we collected, rather than something consistent but entirely different. As a motivating example, here's a method from
And after multiple iterations, with a final blend factor of 1:
Notice how many loop blocks on the hot path have weights that suggest they run less than one time per method execution on average. This is an extreme example, but I think it's sufficient motivation to trade consistency for profile fidelity. Most of the time, we just need synthesis to do the trivial task of propagating flow into blocks that were expanded, consolidated, etc. Also, notice how the entry weight doesn't initially match the exit weight above. This is another problem with OSR methods: When creating a new method entry block to jump to the OSR entry, we guess that its weight should be 1% of the OSR entry block's weight. This almost never matches |
I noticed that tweaking the computation of the entry weight for OSR methods (in service of #111915) incurred large inlining diffs. This is due to the fact that we use the entry block's weight to compute the normalized weight of a call site. This means we will get the wrong normalized weight for call sites when the entry block's weight diverges from the method's call count. This is currently possible only when the entry block is part of a loop, or when we compute a weight for the OSR entry fixup block that differs from fgCalledCount (which we almost always do). The correct thing to do is to use the root method's call count to normalize the call site's weight.
@AndyAyersMS the diffs are still all over the place, but I'm feeling decently confident about this change's efficacy at this point. Now that #112662 is in, late profile repair should never modify |
@AndyAyersMS are you ok with this going in as-is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's go ahead with this.
Part of #107749. Introduce a repair phase that runs profile synthesis if the profile was marked inconsistent at some point, and call it at the end of the JIT frontend. LSRA and block layout are likely to benefit from profile consistency, and this is late enough in the JIT that the sources of diffs should be relatively obvious. We can save a bit of TP by precomputing a loop-aware DFS for profile repair to use, and then letting LSRA reuse this traversal to avoid redundant computations -- I'll save this for a follow-up PR.