JIT: Use linear block order for MinOpts in LSRA #108147

amanasifkhalid · 2024-09-23T15:36:10Z

Follow-up from #108086 (comment). Since LSRA shouldn't create cross-block live registers in MinOpts, LSRA's block ordering shouldn't matter, so just use the lexical order we have on-hand.

cc @dotnet/jit-contrib

dotnet-policy-service · 2024-09-23T15:36:51Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2024-09-23T17:55:35Z

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minops methods, but still...)

amanasifkhalid · 2024-09-23T19:08:00Z

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minops methods, but still...)

I looked at the JitDisasm summary, and there are only a handful of MinOpts methods in libraries.pmi. The standout method is Microsoft.FSharp.Linq.RuntimeHelpers.LeafExpressionConverter:ConvExprToLinqInContext, which has 2459 basic blocks. I'm not able to run pin.exe on my AMD-based DevBox to take a closer look at the TP cost for this method, but perhaps there's some locality issue with using lexical order for super-large methods? For example, if we need to grab the predecessor of a block, the pred is less likely to be cached if we didn't use an RPO traversal, but this only becomes a problem if we have lots of basic blocks.

amanasifkhalid · 2024-09-27T19:38:56Z

Per discussion here, I'm going to try using the optimized layout all the time to see if we can get around some of the pitfalls of using a RPO sequence on its own.

This reverts commit 7a475a9.

This reverts commit c727ff6.

amanasifkhalid · 2024-10-04T16:08:00Z

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minopts methods, but still...)

Sorry for getting back to you so late on this, but I finally got pin working thanks to Jakob. The bulk of the TP regression for the above method is coming from LinearScan::processKills. I did some rudimentary profiling, and both traversals call processKills the same number of times, but when using the lexical order, processKills, ends up doing a lot more work for this method: The inner loop runs for about 3.5x more iterations in total. I wonder if processing register kills in some order that doesn't match control flow resulted in some duplicated work -- perhaps the logic for killing registers can be simplified quite a bit for MinOpts, since we don't have cross-block live registers? @kunalspathak would you expect the register killing behavior to change in MinOpts when the block order changes?

amanasifkhalid · 2025-02-18T15:55:45Z

I took another look at the TP regression noted above, and it had to do with LinearScan::getKillSetForCall relying on Compiler::compFloatingPointUsed to determine if floating-point registers need to be killed or not. For this particular method, if LSRA iterates blocks in lexical order, it encounters a block that uses FP registers much earlier than it would've if it had used RPO, and because compFloatingPointUsed is global Compiler state, we end up unnecessarily processing FP register kills for hundreds of additional blocks. If we don't have cross-block live registers, then I think we can get away with setting this state on a per-block basis. With this approach, the TP diffs improve, and we only have one context in arm64 collections with diffs -- a trivial change in FP register allocation:

-            ldr     q16, [fp, #0xB8]	// [V12 tmp2]
-            str     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
-            ldr     q16, [fp, #0xC8]	// [V12 tmp2+0x10]
-            str     q16, [fp, #0xD1FFAB1E]	// [V04 loc2]
-            ldr     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
-            ldr     q17, [@RWD16]
-            tbl     v16.16b, {v16.16b}, v17.16b
-            str     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q0, [fp, #0xB8]	// [V12 tmp2]
+            str     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q0, [fp, #0xC8]	// [V12 tmp2+0x10]
+            str     q0, [fp, #0xD1FFAB1E]	// [V04 loc2]
+            ldr     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q1, [@RWD16]
+            tbl     v0.16b, {v0.16b}, v1.16b
+            str     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]

I also noticed that we were calling findPredBlockForLiveIn unnecessarily in MinOpts; for methods like the regressed one where some blocks have hundreds of predecessors, the pred iteration in findPredBlockForLiveIn can be expensive, so I've moved this logic onto the "enregister local vars" path.

@kunalspathak could you PTAL? Thanks!

kunalspathak

Changes LGTM and it was good catch regarding findPredBlockForLiveIn(). I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

amanasifkhalid · 2025-02-18T17:48:26Z

I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

If we don't have registers live across blocks, then LSRA will never need to use a block's "ideal" predecessor to identify incoming live registers, so skipping findPredBlockForLiveIn is expected to be a no-diff change. As for the change to compFloatingPointUsed, we can now skip killing FP registers more often, though maybe we don't have heavy use of FP registers intermingled with method calls within one basic block often enough, such that FP register pressure was all that big of a deal in MinOpts? It sounds like a narrow enough case that suggests this wouldn't improve codegen all that often.

jakobbotsch · 2025-02-18T17:48:28Z

src/coreclr/jit/lsrabuild.cpp

    BasicBlock* prevBlock = nullptr;

    // Initialize currentLiveVars to the empty set.  We will set it to the current
    // live-in at the entry to each block (this will include the incoming args on
    // the first block).
    VarSetOps::AssignNoCopy(compiler, currentLiveVars, VarSetOps::MakeEmpty(compiler));
+    const bool floatingPointUsed = compiler->compFloatingPointUsed;


As I understand it compFloatingPointUsed is meant to be set before we get to LSRA. What is the case where it is not set and then gets set later? I think it would be better to have some blockUsedFloatingPoint variable that is deterministically set for each block and that LSRA can use for decisions that involve only the particular block.

Ideally yes and we should audit the cases where that is not happening. I remember running into this a while back and had to rely on it only after buildIntervals() where it was guaranteed to set correctly.

In LinearScan::BuildDef, we have a check that sets compFloatingPointUsed to true if the tree we're building the definition for can't use integer registers. I believe you're understanding of compFloatingPointUsed is correct -- LSRA doesn't seem to be using it all that judiciously, hence this PR. I'll try your suggestion in a follow-up.

In LinearScan::BuildDef, we have a check that sets compFloatingPointUsed to true if the tree we're building the definition for can't use integer registers.

We duplicate this check in LinearScan::identifyCandidates, so we have a few spots in LSRA to clean up.

The problem with this is that it is going to come with TP regressions when someone decides to set compFloatingPointUsed correctly in the future. That's why I would like it to be explicit if we are changing LSRA to make "block-local" decisions in some cases where that is ok.

It looks like the only place is when we create kills for calls. So likely we could just have some bool needsToKillFloatRegs that LSRA uses for this purpose, and that is block-local in MinOpts and global (using compFloatingPointUsed) in FullOpts. It would probably come with even better MinOpts TP improvements.

The problem with this is that it is going to come with TP regressions when someone decides to set compFloatingPointUsed correctly in the future. That's why I would like it to be explicit if we are changing LSRA to make "block-local" decisions in some cases where that is ok.

That's valid, I'll open a follow-up PR for this.

kunalspathak · 2025-02-18T17:58:09Z

I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

If we don't have registers live across blocks, then LSRA will never need to use a block's "ideal" predecessor to identify incoming live registers, so skipping findPredBlockForLiveIn is expected to be a no-diff change. As for the change to compFloatingPointUsed, we can now skip killing FP registers more often, though maybe we don't have heavy use of FP registers intermingled with method calls within one basic block often enough, such that FP register pressure was all that big of a deal in MinOpts? It sounds like a narrow enough case that suggests this wouldn't improve codegen all that often.

I meant the change in setBlockSequence() should have shown asmdiffs in MinOpts?

amanasifkhalid · 2025-02-18T17:59:42Z

I meant the change in setBlockSequence() should have shown asmdiffs in MinOpts?

If registers aren't live across blocks, then the order of blocks allocated shouldn't affect codegen, right?

… are needed (#112668) Follow-up to #108147 (comment). Add a flag for determining if `LinearScan::getKillSetForCall` can skip killing floating-point registers so that we can reduce dependencies on global `Compiler` state.

* main: (27 commits) Fold null checks against known non-null values (dotnet#109164) JIT: Always track the context for late devirt (dotnet#112396) JIT: array allocation fixes (dotnet#112676) [H/3] Fix test closing connection too fast (dotnet#112691) Fix LINQ handling of iterator.Take(...).Last(...) (dotnet#112680) [browser][MT] move wasm MT CI legs to extra-platforms (dotnet#112690) JIT: Don't use `Compiler::compFloatingPointUsed` to check if FP kills are needed (dotnet#112668) [LoongArch64] Fix a typo within PR#112166. (dotnet#112672) Fix new EH hang on DebugBreak (dotnet#112640) Use encode callback instead of renting a buffer to write to in DSAKeyFormatHelper Move some links to other doc (dotnet#112574) Reflection-based XmlSerializer - Deserialize empty collections and allow for sub-types in collection items. (dotnet#111723) JIT: Use `fgCalledCount` for OSR method entry weight (dotnet#112662) Use Avx10.2 Instructions in Floating Point Conversions (dotnet#111775) Expose StressLog via CDAC and port StressLogAnalyzer to managed code (dotnet#104999) JIT: Use linear block order for MinOpts in LSRA (dotnet#108147) Update dependencies from https://github.com/dotnet/arcade build 20250213.2 (dotnet#112625) JIT: Clean up and optimize call arg lowering (dotnet#112639) Update dependencies from https://github.com/dotnet/emsdk build 20250217.1 (dotnet#112645) JIT: Support `FIELD_LIST` for returns (dotnet#112308) ...

Use linear block order in MinOpts

c187cdd

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2024

dotnet-policy-service bot assigned amanasifkhalid Sep 23, 2024

build-analysis bot mentioned this pull request Sep 23, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

3 tasks

amanasifkhalid added 2 commits September 26, 2024 11:35

Merge branch 'main' into lsra-minopts-block-order

399a261

Use lexical order all the time

c727ff6

amanasifkhalid added 4 commits September 27, 2024 16:13

Remove redundant visit check

7a475a9

Merge branch 'main' into lsra-minopts-block-order

fb85284

Revert "Remove redundant visit check"

e69d0c7

This reverts commit 7a475a9.

Revert "Use lexical order all the time"

32dcc3a

This reverts commit c727ff6.

amanasifkhalid mentioned this pull request Oct 2, 2024

JIT: Visit blocks in RPO during LSRA #107927

Merged

Merge branch 'main' into lsra-minopts-block-order

2b1e37b

This was referenced Oct 3, 2024

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

amanasifkhalid added 3 commits December 3, 2024 16:54

Merge from main

6ff9d08

Fix merge

a6a022b

Fix lambda capture

ada96c0

build-analysis bot mentioned this pull request Dec 4, 2024

System.Formats.Nrbf.Tests timeouts #110285

Closed

amanasifkhalid added 4 commits February 15, 2025 18:17

Merge from main

41dede3

Skip FP kills and finding live-in blocks in MinOpts

7d54a9b

Revert

fc2e351

Try resetting compFloatingPointUsed between blocks

8dc1df2

kunalspathak approved these changes Feb 18, 2025

View reviewed changes

jakobbotsch reviewed Feb 18, 2025

View reviewed changes

amanasifkhalid merged commit 230fea1 into dotnet:main Feb 18, 2025
112 checks passed

amanasifkhalid deleted the lsra-minopts-block-order branch February 18, 2025 20:10

amanasifkhalid mentioned this pull request Feb 18, 2025

JIT: Don't use Compiler::compFloatingPointUsed to check if FP kills are needed #112668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Use linear block order for MinOpts in LSRA #108147

JIT: Use linear block order for MinOpts in LSRA #108147

amanasifkhalid commented Sep 23, 2024

dotnet-policy-service bot commented Sep 23, 2024

AndyAyersMS commented Sep 23, 2024

amanasifkhalid commented Sep 23, 2024

amanasifkhalid commented Sep 27, 2024

amanasifkhalid commented Oct 4, 2024

amanasifkhalid commented Feb 18, 2025 •

edited

Loading

kunalspathak left a comment

amanasifkhalid commented Feb 18, 2025

jakobbotsch Feb 18, 2025

kunalspathak Feb 18, 2025

amanasifkhalid Feb 18, 2025

amanasifkhalid Feb 18, 2025

jakobbotsch Feb 18, 2025

jakobbotsch Feb 18, 2025 •

edited

Loading

amanasifkhalid Feb 18, 2025

kunalspathak commented Feb 18, 2025

amanasifkhalid commented Feb 18, 2025

JIT: Use linear block order for MinOpts in LSRA #108147

JIT: Use linear block order for MinOpts in LSRA #108147

Conversation

amanasifkhalid commented Sep 23, 2024

dotnet-policy-service bot commented Sep 23, 2024

AndyAyersMS commented Sep 23, 2024

amanasifkhalid commented Sep 23, 2024

amanasifkhalid commented Sep 27, 2024

amanasifkhalid commented Oct 4, 2024

amanasifkhalid commented Feb 18, 2025 • edited Loading

kunalspathak left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Feb 18, 2025

jakobbotsch Feb 18, 2025

Choose a reason for hiding this comment

kunalspathak Feb 18, 2025

Choose a reason for hiding this comment

amanasifkhalid Feb 18, 2025

Choose a reason for hiding this comment

amanasifkhalid Feb 18, 2025

Choose a reason for hiding this comment

jakobbotsch Feb 18, 2025

Choose a reason for hiding this comment

jakobbotsch Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

amanasifkhalid Feb 18, 2025

Choose a reason for hiding this comment

kunalspathak commented Feb 18, 2025

amanasifkhalid commented Feb 18, 2025

amanasifkhalid commented Feb 18, 2025 •

edited

Loading

jakobbotsch Feb 18, 2025 •

edited

Loading