Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Use linear block order for MinOpts in LSRA #108147

Merged
merged 15 commits into from
Feb 18, 2025

Conversation

amanasifkhalid
Copy link
Member

Follow-up from #108086 (comment). Since LSRA shouldn't create cross-block live registers in MinOpts, LSRA's block ordering shouldn't matter, so just use the lexical order we have on-hand.

cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minops methods, but still...)

image

@amanasifkhalid
Copy link
Member Author

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minops methods, but still...)

I looked at the JitDisasm summary, and there are only a handful of MinOpts methods in libraries.pmi. The standout method is Microsoft.FSharp.Linq.RuntimeHelpers.LeafExpressionConverter:ConvExprToLinqInContext, which has 2459 basic blocks. I'm not able to run pin.exe on my AMD-based DevBox to take a closer look at the TP cost for this method, but perhaps there's some locality issue with using lexical order for super-large methods? For example, if we need to grab the predecessor of a block, the pred is less likely to be cached if we didn't use an RPO traversal, but this only becomes a problem if we have lots of basic blocks.

@amanasifkhalid
Copy link
Member Author

Per discussion here, I'm going to try using the optimized layout all the time to see if we can get around some of the pitfalls of using a RPO sequence on its own.

@amanasifkhalid
Copy link
Member Author

This is quite odd... probably worth figuring out what is going on in libraries pmi (likely very few minopts methods, but still...)

Sorry for getting back to you so late on this, but I finally got pin working thanks to Jakob. The bulk of the TP regression for the above method is coming from LinearScan::processKills. I did some rudimentary profiling, and both traversals call processKills the same number of times, but when using the lexical order, processKills, ends up doing a lot more work for this method: The inner loop runs for about 3.5x more iterations in total. I wonder if processing register kills in some order that doesn't match control flow resulted in some duplicated work -- perhaps the logic for killing registers can be simplified quite a bit for MinOpts, since we don't have cross-block live registers? @kunalspathak would you expect the register killing behavior to change in MinOpts when the block order changes?

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Feb 18, 2025

I took another look at the TP regression noted above, and it had to do with LinearScan::getKillSetForCall relying on Compiler::compFloatingPointUsed to determine if floating-point registers need to be killed or not. For this particular method, if LSRA iterates blocks in lexical order, it encounters a block that uses FP registers much earlier than it would've if it had used RPO, and because compFloatingPointUsed is global Compiler state, we end up unnecessarily processing FP register kills for hundreds of additional blocks. If we don't have cross-block live registers, then I think we can get away with setting this state on a per-block basis. With this approach, the TP diffs improve, and we only have one context in arm64 collections with diffs -- a trivial change in FP register allocation:

-            ldr     q16, [fp, #0xB8]	// [V12 tmp2]
-            str     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
-            ldr     q16, [fp, #0xC8]	// [V12 tmp2+0x10]
-            str     q16, [fp, #0xD1FFAB1E]	// [V04 loc2]
-            ldr     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
-            ldr     q17, [@RWD16]
-            tbl     v16.16b, {v16.16b}, v17.16b
-            str     q16, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q0, [fp, #0xB8]	// [V12 tmp2]
+            str     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q0, [fp, #0xC8]	// [V12 tmp2+0x10]
+            str     q0, [fp, #0xD1FFAB1E]	// [V04 loc2]
+            ldr     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]
+            ldr     q1, [@RWD16]
+            tbl     v0.16b, {v0.16b}, v1.16b
+            str     q0, [fp, #0xD1FFAB1E]	// [V03 loc1]

I also noticed that we were calling findPredBlockForLiveIn unnecessarily in MinOpts; for methods like the regressed one where some blocks have hundreds of predecessors, the pred iteration in findPredBlockForLiveIn can be expensive, so I've moved this logic onto the "enregister local vars" path.

@kunalspathak could you PTAL? Thanks!

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM and it was good catch regarding findPredBlockForLiveIn(). I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

@amanasifkhalid
Copy link
Member Author

I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

If we don't have registers live across blocks, then LSRA will never need to use a block's "ideal" predecessor to identify incoming live registers, so skipping findPredBlockForLiveIn is expected to be a no-diff change. As for the change to compFloatingPointUsed, we can now skip killing FP registers more often, though maybe we don't have heavy use of FP registers intermingled with method calls within one basic block often enough, such that FP register pressure was all that big of a deal in MinOpts? It sounds like a narrow enough case that suggests this wouldn't improve codegen all that often.

BasicBlock* prevBlock = nullptr;

// Initialize currentLiveVars to the empty set. We will set it to the current
// live-in at the entry to each block (this will include the incoming args on
// the first block).
VarSetOps::AssignNoCopy(compiler, currentLiveVars, VarSetOps::MakeEmpty(compiler));
const bool floatingPointUsed = compiler->compFloatingPointUsed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it compFloatingPointUsed is meant to be set before we get to LSRA. What is the case where it is not set and then gets set later? I think it would be better to have some blockUsedFloatingPoint variable that is deterministically set for each block and that LSRA can use for decisions that involve only the particular block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes and we should audit the cases where that is not happening. I remember running into this a while back and had to rely on it only after buildIntervals() where it was guaranteed to set correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LinearScan::BuildDef, we have a check that sets compFloatingPointUsed to true if the tree we're building the definition for can't use integer registers. I believe you're understanding of compFloatingPointUsed is correct -- LSRA doesn't seem to be using it all that judiciously, hence this PR. I'll try your suggestion in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LinearScan::BuildDef, we have a check that sets compFloatingPointUsed to true if the tree we're building the definition for can't use integer registers.

We duplicate this check in LinearScan::identifyCandidates, so we have a few spots in LSRA to clean up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that it is going to come with TP regressions when someone decides to set compFloatingPointUsed correctly in the future. That's why I would like it to be explicit if we are changing LSRA to make "block-local" decisions in some cases where that is ok.

Copy link
Member

@jakobbotsch jakobbotsch Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only place is when we create kills for calls. So likely we could just have some bool needsToKillFloatRegs that LSRA uses for this purpose, and that is block-local in MinOpts and global (using compFloatingPointUsed) in FullOpts. It would probably come with even better MinOpts TP improvements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that it is going to come with TP regressions when someone decides to set compFloatingPointUsed correctly in the future. That's why I would like it to be explicit if we are changing LSRA to make "block-local" decisions in some cases where that is ok.

That's valid, I'll open a follow-up PR for this.

@kunalspathak
Copy link
Member

I was hoping to see asmdiff for MinOpts scenario with this change but how is it a zero diff?

If we don't have registers live across blocks, then LSRA will never need to use a block's "ideal" predecessor to identify incoming live registers, so skipping findPredBlockForLiveIn is expected to be a no-diff change. As for the change to compFloatingPointUsed, we can now skip killing FP registers more often, though maybe we don't have heavy use of FP registers intermingled with method calls within one basic block often enough, such that FP register pressure was all that big of a deal in MinOpts? It sounds like a narrow enough case that suggests this wouldn't improve codegen all that often.

I meant the change in setBlockSequence() should have shown asmdiffs in MinOpts?

@amanasifkhalid
Copy link
Member Author

I meant the change in setBlockSequence() should have shown asmdiffs in MinOpts?

If registers aren't live across blocks, then the order of blocks allocated shouldn't affect codegen, right?

@amanasifkhalid amanasifkhalid merged commit 230fea1 into dotnet:main Feb 18, 2025
112 checks passed
@amanasifkhalid amanasifkhalid deleted the lsra-minopts-block-order branch February 18, 2025 20:10
amanasifkhalid added a commit that referenced this pull request Feb 19, 2025
… are needed (#112668)

Follow-up to #108147 (comment). Add a flag for determining if `LinearScan::getKillSetForCall` can skip killing floating-point registers so that we can reduce dependencies on global `Compiler` state.
grendello added a commit to grendello/runtime that referenced this pull request Feb 19, 2025
* main: (27 commits)
  Fold null checks against known non-null values (dotnet#109164)
  JIT: Always track the context for late devirt (dotnet#112396)
  JIT: array allocation fixes (dotnet#112676)
  [H/3] Fix test closing connection too fast (dotnet#112691)
  Fix LINQ handling of iterator.Take(...).Last(...) (dotnet#112680)
  [browser][MT] move wasm MT CI legs to extra-platforms (dotnet#112690)
  JIT: Don't use `Compiler::compFloatingPointUsed` to check if FP kills are needed (dotnet#112668)
  [LoongArch64] Fix a typo within PR#112166. (dotnet#112672)
  Fix new EH hang on DebugBreak (dotnet#112640)
  Use encode callback instead of renting a buffer to write to in DSAKeyFormatHelper
  Move some links to other doc (dotnet#112574)
  Reflection-based XmlSerializer - Deserialize empty collections and allow for sub-types in collection items. (dotnet#111723)
  JIT: Use `fgCalledCount` for OSR method entry weight (dotnet#112662)
  Use Avx10.2 Instructions in Floating Point Conversions (dotnet#111775)
  Expose StressLog via CDAC and port StressLogAnalyzer to managed code (dotnet#104999)
  JIT: Use linear block order for MinOpts in LSRA (dotnet#108147)
  Update dependencies from https://github.com/dotnet/arcade build 20250213.2 (dotnet#112625)
  JIT: Clean up and optimize call arg lowering (dotnet#112639)
  Update dependencies from https://github.com/dotnet/emsdk build 20250217.1 (dotnet#112645)
  JIT: Support `FIELD_LIST` for returns (dotnet#112308)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants