JIT: Run profile repair after frontend phases #111915

amanasifkhalid · 2025-01-28T17:41:21Z

Part of #107749. Introduce a repair phase that runs profile synthesis if the profile was marked inconsistent at some point, and call it at the end of the JIT frontend. LSRA and block layout are likely to benefit from profile consistency, and this is late enough in the JIT that the sources of diffs should be relatively obvious. We can save a bit of TP by precomputing a loop-aware DFS for profile repair to use, and then letting LSRA reuse this traversal to avoid redundant computations -- I'll save this for a follow-up PR.

dotnet-policy-service · 2025-01-28T17:41:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2025-01-28T23:50:34Z

With this change, I'm seeing LSRA introduce more branches from critical edge splitting, which I wasn't expecting. I'll dig into this

amanasifkhalid · 2025-01-29T18:42:44Z

With this change, I'm seeing LSRA introduce more branches from critical edge splitting, which I wasn't expecting. I'll dig into this

Profile synthesis currently isn't all that diligent in retaining the method's entry weight; if we are retaining likelihoods, it will use BB_UNITY_WEIGHT instead of fgCalledCount without updating the latter, which throws off normalized weight computations in LSRA. I've opened #111971 to address this.

Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.

amanasifkhalid · 2025-02-03T18:05:37Z

With #111971 merged in, I'm not seeing profile repair overwrite fgCalledCount anymore.

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs and TP impact show that profile repair needs to kick in pretty often in our PGO collections by the time we get to the backend. I'd expect that having profile repair use a profile-aware DFS tree wouldn't change behavior, and then we could reuse this throughout LSRA (and layout, if LSRA doesn't introduce new blocks); this might bring down the TP cost a little. Thanks!

amanasifkhalid · 2025-02-07T18:05:00Z

@AndyAyersMS no rush on this, but since I have a few high-churn layout PRs coming up, would you prefer to see the profile changes go in first, or should we get the 3-opt consolidation work in first? I have no preference.

AndyAyersMS

Let's plan on taking this one first.

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

amanasifkhalid · 2025-02-10T15:30:46Z

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

The jit-analyze summaries aren't all that enlightening... I think it's just a matter of numbers, in this case: on win-x64, only 45k methods make it to FullOpts. It's 81k methods on win-x86, and 117k methods on linux-x64. For these methods that are optimized on some platforms but not others, I suspect the ratio of size regressions to size improvements is larger, as the growth in regressions outpaces the increase in optimized methods.

AndyAyersMS · 2025-02-10T15:50:23Z

There seems to be a big size improvement/regression imbalance in some (but not all) collections for benchmarks.run_pgo -- any idea why this is so prominent in linux x64 or windows x86 but not in windows x64?

The jit-analyze summaries aren't all that enlightening... I think it's just a matter of numbers, in this case: on win-x64, only 45k methods make it to FullOpts. It's 81k methods on win-x86, and 117k methods on linux-x64. For these methods that are optimized on some platforms but not others, I suspect the ratio of size regressions to size improvements is larger, as the growth in regressions outpaces the increase in optimized methods.

That also seems strange... I would not expect the SPMI method mix to vary this widely this across platforms. So I'm not sure what to make of the results anymore.

amanasifkhalid · 2025-02-10T23:27:30Z

That also seems strange... I would not expect the SPMI method mix to vary this widely this across platforms. So I'm not sure what to make of the results anymore.

After offline discussion, I opened #112380 to track this issue of different numbers of duplicated contexts across platforms.

amanasifkhalid · 2025-02-12T19:28:15Z

Unsurprisingly, synthesis is prone to taking multiple iterations to converge for OSR methods, which typically leads to loop weights being flattened by increasingly blended likelihoods. It sounds counterintuitive. but I think if we tell profile consistency to retain edge likelihoods, we should avoid blending likelihoods at all, even if it means we cannot re-establish consistency. I suspect we're better off with a moderately inconsistent profile that still resembles the PGO data we collected, rather than something consistent but entirely different. As a motivating example, here's a method from aspnet before late profile repair:


---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             4.28   13 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB03 [0001]  1       BB16                166.07  498 [004..01C)-> BB05(1)                 (always)                     i IBC bwd bwd-target
BB04 [0002]  2       BB09,BB15           344.62 1034 [01C..044)-> BB05(1)                 (always)                     i IBC bwd bwd-target no-cse
BB05 [0003]  2       BB03,BB04           522.54 1568 [044..048)-> BB16(0.161),BB06(0.839) ( cond )                     i IBC bwd
BB06 [0005]  2       BB05,BB01            35.35  106 [048..051)-> BB08(1),BB07(0)         ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0046]  1       BB06                  0       0 [050..051)-> BB16(1)                 (always)                     i IBC rare bwd
BB08 [0047]  1       BB06                 35.35  106 [050..051)-> BB10(1),BB09(0)         ( cond )                     i IBC bwd
BB09 [0048]  1       BB08                  0       0 [050..051)-> BB04(1)                 (always)                     i IBC rare bwd
BB10 [0049]  1       BB08                 35.35  106 [050..051)-> BB12(1),BB11(0)         ( cond )                     i IBC bwd
BB11 [0050]  1       BB10                  0       0 [050..051)-> BB16(1)                 (always)                     i IBC rare bwd
BB12 [0051]  1       BB10                 35.35  106 [050..051)-> BB14(0.597),BB13(0.403) ( cond )                     i IBC bwd
BB13 [0052]  1       BB12                 14.23   43 [050..051)-> BB15(1)                 (always)                     i IBC bwd
BB14 [0053]  1       BB12                 21.12   63 [050..051)-> BB15(1)                 (always)                     i IBC hascall gcsafe bwd
BB15 [0054]  2       BB13,BB14           438.42 1315 [050..065)-> BB04(0.806),BB16(0.194) ( cond )                     i IBC bwd no-cse
BB16 [0006]  4       BB05,BB07,BB11,BB15 167.07  501 [065..08C)-> BB03(0.994),BB18(0.00595)   ( cond )                     i IBC bwd no-cse
BB18 [0008]  1       BB16                  0.99    3 [08C..08D)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And after multiple iterations, with a final blend factor of 1:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             1     13 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB03 [0001]  1       BB16                  1.90  24 [004..01C)-> BB05(1)                 (always)                     i IBC bwd bwd-target
BB04 [0002]  2       BB09,BB15             0.85  11 [01C..044)-> BB05(1)                 (always)                     i IBC bwd bwd-target no-cse
BB05 [0003]  2       BB03,BB04             2.75  35 [044..048)-> BB16(0.344),BB06(0.656) ( cond )                     i IBC bwd
BB06 [0005]  2       BB05,BB01             2.81  36 [048..051)-> BB08(0.48),BB07(0.52)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0046]  1       BB06                  1.46  19 [050..051)-> BB16(1)                 (always)                     i IBC bwd
BB08 [0047]  1       BB06                  1.35  17 [050..051)-> BB10(0.48),BB09(0.52)   ( cond )                     i IBC bwd
BB09 [0048]  1       BB08                  0.70   9 [050..051)-> BB04(1)                 (always)                     i IBC bwd
BB10 [0049]  1       BB08                  0.65   8 [050..051)-> BB12(0.48),BB11(0.52)   ( cond )                     i IBC bwd
BB11 [0050]  1       BB10                  0.34   4 [050..051)-> BB16(1)                 (always)                     i IBC bwd
BB12 [0051]  1       BB10                  0.31   4 [050..051)-> BB14(0.48),BB13(0.52)   ( cond )                     i IBC bwd
BB13 [0052]  1       BB12                  0.16   2 [050..051)-> BB15(1)                 (always)                     i IBC bwd
BB14 [0053]  1       BB12                  0.15   2 [050..051)-> BB15(1)                 (always)                     i IBC hascall gcsafe bwd
BB15 [0054]  2       BB13,BB14             0.31   4 [050..065)-> BB04(0.48),BB16(0.52)   ( cond )                     i IBC bwd no-cse
BB16 [0006]  4       BB05,BB07,BB11,BB15   2.90  37 [065..08C)-> BB03(0.656),BB18(0.344) ( cond )                     i IBC bwd no-cse
BB18 [0008]  1       BB16                  1.00  13 [08C..08D)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice how many loop blocks on the hot path have weights that suggest they run less than one time per method execution on average. This is an extreme example, but I think it's sufficient motivation to trade consistency for profile fidelity. Most of the time, we just need synthesis to do the trivial task of propagating flow into blocks that were expanded, consolidated, etc.

Also, notice how the entry weight doesn't initially match the exit weight above. This is another problem with OSR methods: When creating a new method entry block to jump to the OSR entry, we guess that its weight should be 1% of the OSR entry block's weight. This almost never matches fgCalledCount, so when we run synthesis after morph has canonicalized the method entry, fgCalledCount is changed to this canned weight, which churns LSRA (and any other backend phase reliant on normalized weights). I think we ought to just use the early computation of fgCalledCount for the fixed entry block's weight, but I'll do that in a separate PR.

I noticed that tweaking the computation of the entry weight for OSR methods (in service of #111915) incurred large inlining diffs. This is due to the fact that we use the entry block's weight to compute the normalized weight of a call site. This means we will get the wrong normalized weight for call sites when the entry block's weight diverges from the method's call count. This is currently possible only when the entry block is part of a loop, or when we compute a weight for the OSR entry fixup block that differs from fgCalledCount (which we almost always do). The correct thing to do is to use the root method's call count to normalize the call site's weight.

Follow-up to #112499. By ensuring OSR method entry blocks' weights match fgCalledCount during entry canonicalization, we can guarantee that later runs of profile synthesis won't modify fgCalledCount. This should unblock #111915 (comment).

amanasifkhalid · 2025-02-19T03:31:02Z

@AndyAyersMS the diffs are still all over the place, but I'm feeling decently confident about this change's efficacy at this point. Now that #112662 is in, late profile repair should never modify fgCalledCount, so we shouldn't see any more/less critical edge splitting in LSRA due to normalized weights getting thrown off. Late profile repair also won't modify edge likelihoods at all, so we aren't giving up any fidelity here; at worst, we'll accept an inconsistent profile that otherwise resembles its initial shape.

amanasifkhalid · 2025-02-21T16:14:24Z

@AndyAyersMS are you ok with this going in as-is?

AndyAyersMS

Yes, let's go ahead with this.

Run profile repair after frontend phases

c267d7e

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 28, 2025

dotnet-policy-service bot assigned amanasifkhalid Jan 28, 2025

build-analysis bot mentioned this pull request Jan 28, 2025

The feed is either invalid or required packages were removed while the current operation was in progress dotnet/dnceng#3825

Open

3 tasks

Merge branch 'main' into late-profile-repair

bb400a9

amanasifkhalid mentioned this pull request Jan 29, 2025

JIT: Try to retain entry weight during profile synthesis #111971

Merged

Merge from main

6c4c619

This was referenced Feb 2, 2025

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

Fix bone-headed merge

d688269

This was referenced Feb 3, 2025

JIT: Try to reuse profile-aware DFS tree throughout backend #112105

Closed

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

Open

AndyAyersMS reviewed Feb 7, 2025

View reviewed changes

Merge branch 'main' into late-profile-repair

5762fb2

amanasifkhalid mentioned this pull request Feb 12, 2025

JIT: Use fgCalledCount in inlinee weight computation #112499

Merged

Don't blend likelihoods during late repair

47926e4

This was referenced Feb 18, 2025

JIT: Use fgCalledCount for OSR method entry weight #112662

Merged

[Perf] Linux/x64: 24 Regressions on 2/14/2025 8:07:17 PM +00:00 #112658

Open

Merge branch 'main' into late-profile-repair

fcf2ac8

AndyAyersMS approved these changes Feb 21, 2025

View reviewed changes

amanasifkhalid merged commit ea43e17 into dotnet:main Feb 21, 2025
112 checks passed

amanasifkhalid deleted the late-profile-repair branch February 21, 2025 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Run profile repair after frontend phases #111915

JIT: Run profile repair after frontend phases #111915

amanasifkhalid commented Jan 28, 2025

dotnet-policy-service bot commented Jan 28, 2025

amanasifkhalid commented Jan 28, 2025

amanasifkhalid commented Jan 29, 2025

amanasifkhalid commented Feb 3, 2025

amanasifkhalid commented Feb 7, 2025

AndyAyersMS left a comment

amanasifkhalid commented Feb 10, 2025 •

edited

Loading

AndyAyersMS commented Feb 10, 2025

amanasifkhalid commented Feb 10, 2025

amanasifkhalid commented Feb 12, 2025 •

edited

Loading

amanasifkhalid commented Feb 19, 2025

amanasifkhalid commented Feb 21, 2025

AndyAyersMS left a comment

JIT: Run profile repair after frontend phases #111915

JIT: Run profile repair after frontend phases #111915

Conversation

amanasifkhalid commented Jan 28, 2025

dotnet-policy-service bot commented Jan 28, 2025

amanasifkhalid commented Jan 28, 2025

amanasifkhalid commented Jan 29, 2025

amanasifkhalid commented Feb 3, 2025

amanasifkhalid commented Feb 7, 2025

AndyAyersMS left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Feb 10, 2025 • edited Loading

AndyAyersMS commented Feb 10, 2025

amanasifkhalid commented Feb 10, 2025

amanasifkhalid commented Feb 12, 2025 • edited Loading

amanasifkhalid commented Feb 19, 2025

amanasifkhalid commented Feb 21, 2025

AndyAyersMS left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Feb 10, 2025 •

edited

Loading

amanasifkhalid commented Feb 12, 2025 •

edited

Loading