Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Run profile repair after frontend phases #111915

Merged
merged 7 commits into from
Feb 21, 2025

Conversation

amanasifkhalid
Copy link
Member

Part of #107749. Introduce a repair phase that runs profile synthesis if the profile was marked inconsistent at some point, and call it at the end of the JIT frontend. LSRA and block layout are likely to benefit from profile consistency, and this is late enough in the JIT that the sources of diffs should be relatively obvious. We can save a bit of TP by precomputing a loop-aware DFS for profile repair to use, and then letting LSRA reuse this traversal to avoid redundant computations -- I'll save this for a follow-up PR.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 28, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

With this change, I'm seeing LSRA introduce more branches from critical edge splitting, which I wasn't expecting. I'll dig into this

@amanasifkhalid
Copy link
Member Author

With this change, I'm seeing LSRA introduce more branches from critical edge splitting, which I wasn't expecting. I'll dig into this

Profile synthesis currently isn't all that diligent in retaining the method's entry weight; if we are retaining likelihoods, it will use BB_UNITY_WEIGHT instead of fgCalledCount without updating the latter, which throws off normalized weight computations in LSRA. I've opened #111971 to address this.

amanasifkhalid added a commit that referenced this pull request Jan 31, 2025
Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.
@amanasifkhalid
Copy link
Member Author

With #111971 merged in, I'm not seeing profile repair overwrite fgCalledCount anymore.

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs and TP impact show that profile repair needs to kick in pretty often in our PGO collections by the time we get to the backend. I'd expect that having profile repair use a profile-aware DFS tree wouldn't change behavior, and then we could reuse this throughout LSRA (and layout, if LSRA doesn't introduce new blocks); this might bring down the TP cost a little. Thanks!

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS no rush on this, but since I have a few high-churn layout PRs coming up, would you prefer to see the profile changes go in first, or should we get the 3-opt consolidation work in first? I have no preference.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's plan on taking this one first.

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Feb 10, 2025

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

The jit-analyze summaries aren't all that enlightening... I think it's just a matter of numbers, in this case: on win-x64, only 45k methods make it to FullOpts. It's 81k methods on win-x86, and 117k methods on linux-x64. For these methods that are optimized on some platforms but not others, I suspect the ratio of size regressions to size improvements is larger, as the growth in regressions outpaces the increase in optimized methods.

@AndyAyersMS
Copy link
Member

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

The jit-analyze summaries aren't all that enlightening... I think it's just a matter of numbers, in this case: on win-x64, only 45k methods make it to FullOpts. It's 81k methods on win-x86, and 117k methods on linux-x64. For these methods that are optimized on some platforms but not others, I suspect the ratio of size regressions to size improvements is larger, as the growth in regressions outpaces the increase in optimized methods.

That also seems strange... I would not expect the SPMI method mix to vary this widely this across platforms. So I'm not sure what to make of the results anymore.

@amanasifkhalid
Copy link
Member Author

That also seems strange... I would not expect the SPMI method mix to vary this widely this across platforms. So I'm not sure what to make of the results anymore.

After offline discussion, I opened #112380 to track this issue of different numbers of duplicated contexts across platforms.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Feb 12, 2025

Unsurprisingly, synthesis is prone to taking multiple iterations to converge for OSR methods, which typically leads to loop weights being flattened by increasingly blended likelihoods. It sounds counterintuitive. but I think if we tell profile consistency to retain edge likelihoods, we should avoid blending likelihoods at all, even if it means we cannot re-establish consistency. I suspect we're better off with a moderately inconsistent profile that still resembles the PGO data we collected, rather than something consistent but entirely different. As a motivating example, here's a method from aspnet before late profile repair:


---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             4.28   13 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB03 [0001]  1       BB16                166.07  498 [004..01C)-> BB05(1)                 (always)                     i IBC bwd bwd-target
BB04 [0002]  2       BB09,BB15           344.62 1034 [01C..044)-> BB05(1)                 (always)                     i IBC bwd bwd-target no-cse
BB05 [0003]  2       BB03,BB04           522.54 1568 [044..048)-> BB16(0.161),BB06(0.839) ( cond )                     i IBC bwd
BB06 [0005]  2       BB05,BB01            35.35  106 [048..051)-> BB08(1),BB07(0)         ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0046]  1       BB06                  0       0 [050..051)-> BB16(1)                 (always)                     i IBC rare bwd
BB08 [0047]  1       BB06                 35.35  106 [050..051)-> BB10(1),BB09(0)         ( cond )                     i IBC bwd
BB09 [0048]  1       BB08                  0       0 [050..051)-> BB04(1)                 (always)                     i IBC rare bwd
BB10 [0049]  1       BB08                 35.35  106 [050..051)-> BB12(1),BB11(0)         ( cond )                     i IBC bwd
BB11 [0050]  1       BB10                  0       0 [050..051)-> BB16(1)                 (always)                     i IBC rare bwd
BB12 [0051]  1       BB10                 35.35  106 [050..051)-> BB14(0.597),BB13(0.403) ( cond )                     i IBC bwd
BB13 [0052]  1       BB12                 14.23   43 [050..051)-> BB15(1)                 (always)                     i IBC bwd
BB14 [0053]  1       BB12                 21.12   63 [050..051)-> BB15(1)                 (always)                     i IBC hascall gcsafe bwd
BB15 [0054]  2       BB13,BB14           438.42 1315 [050..065)-> BB04(0.806),BB16(0.194) ( cond )                     i IBC bwd no-cse
BB16 [0006]  4       BB05,BB07,BB11,BB15 167.07  501 [065..08C)-> BB03(0.994),BB18(0.00595)   ( cond )                     i IBC bwd no-cse
BB18 [0008]  1       BB16                  0.99    3 [08C..08D)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And after multiple iterations, with a final blend factor of 1:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             1     13 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB03 [0001]  1       BB16                  1.90  24 [004..01C)-> BB05(1)                 (always)                     i IBC bwd bwd-target
BB04 [0002]  2       BB09,BB15             0.85  11 [01C..044)-> BB05(1)                 (always)                     i IBC bwd bwd-target no-cse
BB05 [0003]  2       BB03,BB04             2.75  35 [044..048)-> BB16(0.344),BB06(0.656) ( cond )                     i IBC bwd
BB06 [0005]  2       BB05,BB01             2.81  36 [048..051)-> BB08(0.48),BB07(0.52)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0046]  1       BB06                  1.46  19 [050..051)-> BB16(1)                 (always)                     i IBC bwd
BB08 [0047]  1       BB06                  1.35  17 [050..051)-> BB10(0.48),BB09(0.52)   ( cond )                     i IBC bwd
BB09 [0048]  1       BB08                  0.70   9 [050..051)-> BB04(1)                 (always)                     i IBC bwd
BB10 [0049]  1       BB08                  0.65   8 [050..051)-> BB12(0.48),BB11(0.52)   ( cond )                     i IBC bwd
BB11 [0050]  1       BB10                  0.34   4 [050..051)-> BB16(1)                 (always)                     i IBC bwd
BB12 [0051]  1       BB10                  0.31   4 [050..051)-> BB14(0.48),BB13(0.52)   ( cond )                     i IBC bwd
BB13 [0052]  1       BB12                  0.16   2 [050..051)-> BB15(1)                 (always)                     i IBC bwd
BB14 [0053]  1       BB12                  0.15   2 [050..051)-> BB15(1)                 (always)                     i IBC hascall gcsafe bwd
BB15 [0054]  2       BB13,BB14             0.31   4 [050..065)-> BB04(0.48),BB16(0.52)   ( cond )                     i IBC bwd no-cse
BB16 [0006]  4       BB05,BB07,BB11,BB15   2.90  37 [065..08C)-> BB03(0.656),BB18(0.344) ( cond )                     i IBC bwd no-cse
BB18 [0008]  1       BB16                  1.00  13 [08C..08D)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice how many loop blocks on the hot path have weights that suggest they run less than one time per method execution on average. This is an extreme example, but I think it's sufficient motivation to trade consistency for profile fidelity. Most of the time, we just need synthesis to do the trivial task of propagating flow into blocks that were expanded, consolidated, etc.

Also, notice how the entry weight doesn't initially match the exit weight above. This is another problem with OSR methods: When creating a new method entry block to jump to the OSR entry, we guess that its weight should be 1% of the OSR entry block's weight. This almost never matches fgCalledCount, so when we run synthesis after morph has canonicalized the method entry, fgCalledCount is changed to this canned weight, which churns LSRA (and any other backend phase reliant on normalized weights). I think we ought to just use the early computation of fgCalledCount for the fixed entry block's weight, but I'll do that in a separate PR.

amanasifkhalid added a commit that referenced this pull request Feb 18, 2025
I noticed that tweaking the computation of the entry weight for OSR methods (in service of #111915) incurred large inlining diffs. This is due to the fact that we use the entry block's weight to compute the normalized weight of a call site. This means we will get the wrong normalized weight for call sites when the entry block's weight diverges from the method's call count. This is currently possible only when the entry block is part of a loop, or when we compute a weight for the OSR entry fixup block that differs from fgCalledCount (which we almost always do). The correct thing to do is to use the root method's call count to normalize the call site's weight.
amanasifkhalid added a commit that referenced this pull request Feb 19, 2025
Follow-up to #112499. By ensuring OSR method entry blocks' weights match fgCalledCount during entry canonicalization, we can guarantee that later runs of profile synthesis won't modify fgCalledCount. This should unblock #111915 (comment).
@amanasifkhalid
Copy link
Member Author

@AndyAyersMS the diffs are still all over the place, but I'm feeling decently confident about this change's efficacy at this point. Now that #112662 is in, late profile repair should never modify fgCalledCount, so we shouldn't see any more/less critical edge splitting in LSRA due to normalized weights getting thrown off. Late profile repair also won't modify edge likelihoods at all, so we aren't giving up any fidelity here; at worst, we'll accept an inconsistent profile that otherwise resembles its initial shape.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS are you ok with this going in as-is?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's go ahead with this.

@amanasifkhalid amanasifkhalid merged commit ea43e17 into dotnet:main Feb 21, 2025
112 checks passed
@amanasifkhalid amanasifkhalid deleted the late-profile-repair branch February 21, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants