Tweak the register order for xarch to better account for callee saved #88151

tannergooding · 2023-06-28T16:32:59Z

With the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering.

General purpose registers were already generally following the CALLEE_TRASH, CALLEE_SAVE ordering. But weren't taking into account that some registers are more expensive to encode and so should be ordered later within the subgrouping (e.g. r8-15 should come after the base 8 registers).

ghost · 2023-06-28T16:33:11Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

With the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering.

General purpose registers were already generally following the CALLEE_TRASH, CALLEE_SAVE ordering. But weren't taking into account that some registers are more expensive to encode and so should be ordered later within the subgrouping (e.g. r8-15 should come after the base 8 registers).

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2023-06-28T19:24:27Z

Diffs are based on 1,563,220 contexts (466,721 MinOpts, 1,096,499 FullOpts).

MISSED contexts: 259 (0.02%)

Overall (-122,201 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	44,713,084	-7,637
benchmarks.run.windows.x64.checked.mch	7,304,732	-1,266
benchmarks.run_pgo.windows.x64.checked.mch	27,086,559	-10,940
benchmarks.run_tiered.windows.x64.checked.mch	10,597,247	+44
coreclr_tests.run.windows.x64.checked.mch	385,500,995	-3,166
libraries.crossgen2.windows.x64.checked.mch	15,685,800	-8,306
libraries.pmi.windows.x64.checked.mch	59,924,379	-18,880
libraries_tests.pmi.windows.x64.checked.mch	107,991,683	-68,068
realworld.run.windows.x64.checked.mch	13,795,964	-3,982

MinOpts (+133 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	14,870,376	+4
benchmarks.run.windows.x64.checked.mch	374,520	+0
benchmarks.run_pgo.windows.x64.checked.mch	10,282,013	-7
benchmarks.run_tiered.windows.x64.checked.mch	7,604,758	-10
coreclr_tests.run.windows.x64.checked.mch	269,307,950	+146
libraries.pmi.windows.x64.checked.mch	1,519,268	+0
libraries_tests.pmi.windows.x64.checked.mch	5,507,093	+0
realworld.run.windows.x64.checked.mch	1,149,238	+0

FullOpts (-122,334 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	29,842,708	-7,641
benchmarks.run.windows.x64.checked.mch	6,930,212	-1,266
benchmarks.run_pgo.windows.x64.checked.mch	16,804,546	-10,933
benchmarks.run_tiered.windows.x64.checked.mch	2,992,489	+54
coreclr_tests.run.windows.x64.checked.mch	116,193,045	-3,312
libraries.crossgen2.windows.x64.checked.mch	15,685,046	-8,306
libraries.pmi.windows.x64.checked.mch	58,405,111	-18,880
libraries_tests.pmi.windows.x64.checked.mch	102,484,590	-68,068
realworld.run.windows.x64.checked.mch	12,646,726	-3,982

tannergooding · 2023-06-28T19:24:41Z

CC. @kunalspathak

tannergooding · 2023-06-28T19:59:18Z

src/coreclr/jit/targetamd64.h

-                                 REG_R8,REG_R9,REG_R10,REG_R11, \
-                                 REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \
-                                 REG_R14,REG_R15,REG_R12,REG_R13
+  #define REG_VAR_ORDER_CALLEE_TRASH    REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11


EAX being first gives lots of wins because its the return register, has specialized smaller encodings, etc.

ECX before EDX does make a substantial difference, namely because EDX gets used by various instructions as a forced register. If we have EDX then ECX we only see -19,743 overall codegen (so only 1/3 the wins). Despite that, EDX is still better than R8-R11 since its a 1-byte smaller encoding for every usage saving us space overall.

tannergooding · 2023-06-28T20:02:35Z

src/coreclr/jit/targetamd64.h

-                                 REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \
-                                 REG_R14,REG_R15,REG_R12,REG_R13
+  #define REG_VAR_ORDER_CALLEE_TRASH    REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11
+  #define REG_VAR_ORDER_CALLEE_SAVED    REG_EBX,REG_ESI,REG_EDI,REG_ETW_FRAMED_EBP_LIST REG_R14,REG_R15,REG_R12,REG_R13


Likewise, EBX doesn't really have any special usages so it being first gives the most wins. ESI and EDI do have some special usages, but not a ton, so they come next.

EBP, despite taking 4 extra bytes for some addressing modes, is still an overall win over R14-R15 since its 1-byte smaller per usage for everything else, saving space overall.

R12-R13 need to be last since R12 takes an extra byte for the addressing mode and R13 similarly uses extra space to encode the displacement.

tannergooding · 2023-06-28T20:03:46Z

src/coreclr/jit/targetamd64.h

+  #define REG_VAR_ORDER_FLT_CALLEE_SAVED    REG_XMM6,REG_XMM7,REG_XMM8,REG_XMM9,REG_XMM10,REG_XMM11,REG_XMM12,         \
+                                            REG_XMM13,REG_XMM14,REG_XMM15
+
+  #define REG_VAR_ORDER_FLT_EVEX_CALLEE_TRASH   REG_XMM0,REG_XMM1,REG_XMM2,REG_XMM3,REG_XMM4,REG_XMM5,REG_XMM16,       \


We want XMM16-XMM31 to come before the callee saved since the spill and load (2 instructions, 8 bytes minimum) are overall more expensive than the extra byte per instruction

kunalspathak

Do you see any regression at all in code size? looks like superpmi-diff is still running.

src/coreclr/jit/lsraxarch.cpp

kunalspathak · 2023-06-28T23:33:04Z

src/coreclr/jit/targetamd64.h

-#else
-  // TEMPORARY ORDER TO AVOID CALLEE-SAVES
-  // TODO-CQ: Review this and set appropriately
+  // Registers are ordered here to prefer callee trash first, then callee save


Remind me, but why we prefer callee trash first? They need to be save/restored during the call, right, so in that case, callee save should be better, no?

I added a lengthy comment explaining it, but the basics is that we've already been assumed to have used every callee trash (aka volatile or caller saved) register. We only have to save them if the value they contain is live across a call boundary, so saving them is not applicable at all to leaf methods.

Callee saved (aka non-volatile) registers on the other hand have an inherent overhead regardless of whether we have calls or not. This is because the method consuming them must preserve the existing value (typically as part of the prologue) and restore that same value before returning (typically as part of the epilogue). They are only beneficial to use if we have a value that needs to live across a call boundary.

I think in a more ideal world LSRA would independently track callee trash vs callee saved order and it would pick from one of the two pools based on whether the value needs to live across a call boundary. It would do this based on whether a block is hot vs cold and its weight since a callee trash register that is spilled once is "equal" to a callee saved and so callee trash remains "better". But if the value needs to be spilled twice, then callee saved becomes cheaper since its just 1 spill/restore needed for the whole method.

callee trash: We only have to save them if the value they contain is live across a call boundary

Right, there is a cost to save/restore them across call boundaries, but callee save are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries, and hence should be preferred than callee trash.

Related work: #76904 and #81242

but callee save are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries

Right, they can be cheaper in the right circumstances, the inverse is also true:

A value x that is live across 0 calls needs 0 (callee trash) or 1 (callee saved) spills

A value x that is live across 1 call needs 1 (callee trash or callee saved) spill

A value x and value y that are live across the same 1 call needs 2 (callee trash or callee saved) spills

A value x that is live across 2 calls needs 2 (callee trash) or 1 (callee saved) spills

A value x that is live across call A and a value y that is live across call B needs 2 (callee trash) or 1 (callee saved) spills

Locally Testing SPMI shows that having the total order as callee trash, callee saved vs callee saved, callee trash actually makes no difference. This is because of the combination of preferencing done and interval->preferCalleeSave tracking we already do which causes LinearScan::RegisterSelection::select to already filter down to callee trash vs callee saved registers. The ordering within each respective group does still matter and is what this PR touches on (this ordering is used to break ties and comes after other filtering/selection is done).

The overall default state for the LSRA is then callee trash, callee saved since there are overall more callee trash registers, they are frequently preferenced due to their special uses (returns, parameters, required for some instructions, etc), and they are cheaper or the same cost in 3/5 of the scenarios listed above, with those other scenarios already being specially tracked.

The preferCalleeSave tracking we do isn't ideal and that's partially where #76904 and #81242 come into play.

We also notably don't do tie breaking super efficiently as we currently iterate through all candidates each time comparing against their respective ordering to find the smallest one (when an instruction has no preferences, this is all unused registers which is often around half of the total register count).

src/coreclr/jit/targetamd64.h

…allee trash then callee save

…gisters

tannergooding · 2023-06-29T21:50:16Z

Diffs

We're seeing 307k codegen improvements in Linux x64, with 42k being for MinOpts. We're likewise seeing 169k improvements for Windows x64.

Compiling with MSVC shows tp improvements to minopts (-0.08%) and a minor improvement to fullopts (-0.01%). Compiling with Clang shows TP regressions. Clang in general likes to emit more code for certain types of loops and this causes the "regression". The actual code is typically on par or faster.

For Linux, the benchmarks.run and benchmarks.run_tiered do show overall regressions. benchmarks.run_pgo does not and is one of the largest improvements.

For both Windows x64 and Linux x64, there are various individual methods with regressions:

Linux x64: -661,741; +354,732
Windows x64: -355,571; +186,440

The improvements and regressions are basically just inverse of eachother. Improvements come from places where the combination of ordering and preferencing allows for less spilling, less register swapping, etc. Regressions come from places where we see more spilling or register swapping.

Some of this looks to be minor places where LSRA seemingly decides to introduce a move unnecessarily, such as:

        ; gcrRegs -[rcx]
        cmp      ecx, edx
        jbe      SHORT G_M20342_IG04
-       mov      rax, gword ptr [rax+8*rdx+10H]
-						;; size=19 bbWeight=1 PerfScore 9.25
+       mov      ecx, edx
+       mov      rax, gword ptr [rax+8*rcx+10H]
+						;; size=21 bbWeight=1 PerfScore 9.50
 G_M20342_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 40
        ret

It could have kept the value in edx, particularly given it was last use, but opted not to. Some of this may be low hanging fruit we can easily clean up. Others will require a bit more investigation.

Overall the change should be worth taking given its a win by a significant margin.

kunalspathak · 2023-06-29T23:27:39Z

any reason why there are huge regressions and minimal improvements for x86?

Compiling with MSVC shows tp improvements to minopts (-0.08%) and a minor improvement to fullopts (-0.01%)

What justifies the improvement? Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers? Did you try checking? because clearly the change in lsrabuild.cpp should have shown regressions.

tannergooding · 2023-06-30T00:46:58Z

any reason why there are huge regressions and minimal improvements for x86?

x86 only has 6 registers available. 3 callee trash and 3 callee saved. This means it requires a lot more spills to do even basic function handling. Locally I had seen a small bit of improvement, but I've just reverted this since it's showing regressions in CI.

What justifies the improvement? Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers? Did you try checking? because clearly the change in lsrabuild.cpp should have shown regressions.

Not sure what you mean.

Previously we did:

loop from REG_FIRST to AVAILABLE_REG_COUNT calling RegRecord::init(reg)
loop through lsraRegOrder setting RegRecord::regOrder
loop through lsraRegOrderFlt setting RegRecord::regOrder
conditionally loop through lsraRegOrderFltUpper setting RegRecord::regOrder if canUseEvexEncoding

Now we're just doing:

loop from REG_FIRST to AVAILABLE_REG_COUNT calling RegRecord::init(reg)
loop through lsraRegOrder setting RegRecord::regOrder
loop through lsraRegOrderFlt or lsraRegOrderFltEvex, based on canUseEvexEncoding, setting RegRecord::regOrder

The first two steps are identical. The third and fourth steps are now simply combined so we can loop once, rather than twice. We're still doing the same amount of loop iterations (16 vs 32) but are now able to avoid the "overhead" of a second loop and the locality split between two arrays.

We then are also no longer trying to call BuildEvexIncompatibleMask for every GT_CNS_* (INT, LNG, DBL, VEC) which saves some TP as well.

tannergooding · 2023-06-30T00:51:32Z

Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers

The register allocator currently walks all candidates every time to find the one with the lowest order, so changing the order doesn't really provide much in the face of TP wins. We probably should change how the allocator does this so we can do the lookup faster, but that's a separate work item and would likewise be unaffected by ordering.

tannergooding · 2023-06-30T12:17:25Z

New diffs

Same results for x64, no diffs for x86 now.

kunalspathak

LGTM

Tweak the register order for xarch to better account for callee saved

ae45935

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2023

ghost assigned tannergooding Jun 28, 2023

tannergooding force-pushed the evex-reg-order branch 2 times, most recently from b826927 to 9b23a15 Compare June 28, 2023 19:03

Have a different register order for leaf methods

d2c4375

tannergooding force-pushed the evex-reg-order branch from 9b23a15 to d2c4375 Compare June 28, 2023 19:22

tannergooding marked this pull request as ready for review June 28, 2023 19:24

tannergooding commented Jun 28, 2023

View reviewed changes

Preference R12 after R13 for 2x the savings

e3963d5

tannergooding added the avx512 Related to the AVX-512 architecture label Jun 28, 2023

tannergooding added 2 commits June 28, 2023 15:22

Allow float/vector constants to use EVEX registers

8a8b199

Fix Runtime_34937 to expect the right codegen on Unix

c656acf

kunalspathak reviewed Jun 28, 2023

View reviewed changes

kunalspathak mentioned this pull request Jun 29, 2023

windows-x86 release MSBuild_CMake is failing #88172

Closed

tannergooding added 3 commits June 28, 2023 18:54

Don't add VAR_ORDER_LEAF and add a comment explaining why we prefer c…

e3b7562

…allee trash then callee save

Simplifying the REG_VAR_ORDER_FLT_EVEX_CALLEE_* defines

76839da

Ensure x86 matches the general x64 ordering

e433989

tannergooding force-pushed the evex-reg-order branch from 3f7331d to e433989 Compare June 29, 2023 03:08

tannergooding added 2 commits June 29, 2023 04:24

Ensure that AllBitsSet vector constants correctly handle high simd re…

efbacde

…gisters

Ensure that Runtime_34937 expects the right codegen for Windows x64

e988d8f

Improve the register ordering more

ab5dad4

Ensure scalar AllBitsSet handles EVEX registers

d6ed5f7

This was referenced Jun 29, 2023

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

Tracking issue for CI build timeouts #76454

Closed

Don't change register allocation order for x86

cb08c99

This was referenced Jun 30, 2023

Could not load file or assembly 'Microsoft.CodeAnalysis.NetAnalyzers #84995

Closed

Unable to load Analyzer assembly .../Microsoft.CodeAnalysis.Analyzers.dll : Not a valid assembly #85082

Closed

Minimize the diffs

6045683

tannergooding force-pushed the evex-reg-order branch from 0113ebf to 6045683 Compare June 30, 2023 04:58

kunalspathak approved these changes Jun 30, 2023

View reviewed changes

tannergooding merged commit 453d58e into dotnet:main Jun 30, 2023
145 checks passed

tannergooding deleted the evex-reg-order branch June 30, 2023 15:43

jakobbotsch mentioned this pull request Jul 4, 2023

[Perf] Linux/x64: 20 Regressions on 6/30/2023 4:39:46 PM #88375

Closed

cincuranet mentioned this pull request Jul 4, 2023

[Perf] Linux/x64: 11 Improvements on 6/30/2023 4:39:46 PM dotnet/perf-autofiling-issues#19440

Closed

ghost locked as resolved and limited conversation to collaborators Jul 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak the register order for xarch to better account for callee saved #88151

Tweak the register order for xarch to better account for callee saved #88151

tannergooding commented Jun 28, 2023

ghost commented Jun 28, 2023

tannergooding commented Jun 28, 2023 •

edited

Loading

tannergooding commented Jun 28, 2023

tannergooding Jun 28, 2023

tannergooding Jun 28, 2023

tannergooding Jun 28, 2023

kunalspathak left a comment

kunalspathak Jun 28, 2023

tannergooding Jun 29, 2023 •

edited

Loading

kunalspathak Jun 29, 2023

tannergooding Jun 29, 2023

tannergooding Jun 29, 2023

tannergooding commented Jun 29, 2023

kunalspathak commented Jun 29, 2023

tannergooding commented Jun 30, 2023 •

edited

Loading

tannergooding commented Jun 30, 2023

tannergooding commented Jun 30, 2023

kunalspathak left a comment

Tweak the register order for xarch to better account for callee saved #88151

Tweak the register order for xarch to better account for callee saved #88151

Conversation

tannergooding commented Jun 28, 2023

ghost commented Jun 28, 2023

tannergooding commented Jun 28, 2023 • edited Loading

tannergooding commented Jun 28, 2023

tannergooding Jun 28, 2023

Choose a reason for hiding this comment

tannergooding Jun 28, 2023

Choose a reason for hiding this comment

tannergooding Jun 28, 2023

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak Jun 28, 2023

Choose a reason for hiding this comment

tannergooding Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

kunalspathak Jun 29, 2023

Choose a reason for hiding this comment

tannergooding Jun 29, 2023

Choose a reason for hiding this comment

tannergooding Jun 29, 2023

Choose a reason for hiding this comment

tannergooding commented Jun 29, 2023

kunalspathak commented Jun 29, 2023

tannergooding commented Jun 30, 2023 • edited Loading

tannergooding commented Jun 30, 2023

tannergooding commented Jun 30, 2023

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented Jun 28, 2023 •

edited

Loading

tannergooding Jun 29, 2023 •

edited

Loading

tannergooding commented Jun 30, 2023 •

edited

Loading