Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak the register order for xarch to better account for callee saved #88151

Merged
merged 14 commits into from
Jun 30, 2023

Conversation

tannergooding
Copy link
Member

With the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering.

General purpose registers were already generally following the CALLEE_TRASH, CALLEE_SAVE ordering. But weren't taking into account that some registers are more expensive to encode and so should be ordered later within the subgrouping (e.g. r8-15 should come after the base 8 registers).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2023
@ghost ghost assigned tannergooding Jun 28, 2023
@ghost
Copy link

ghost commented Jun 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

With the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering.

General purpose registers were already generally following the CALLEE_TRASH, CALLEE_SAVE ordering. But weren't taking into account that some registers are more expensive to encode and so should be ordered later within the subgrouping (e.g. r8-15 should come after the base 8 registers).

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding tannergooding force-pushed the evex-reg-order branch 2 times, most recently from b826927 to 9b23a15 Compare June 28, 2023 19:03
@tannergooding
Copy link
Member Author

tannergooding commented Jun 28, 2023

Diffs are based on 1,563,220 contexts (466,721 MinOpts, 1,096,499 FullOpts).

MISSED contexts: 259 (0.02%)

Overall (-122,201 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 44,713,084 -7,637
benchmarks.run.windows.x64.checked.mch 7,304,732 -1,266
benchmarks.run_pgo.windows.x64.checked.mch 27,086,559 -10,940
benchmarks.run_tiered.windows.x64.checked.mch 10,597,247 +44
coreclr_tests.run.windows.x64.checked.mch 385,500,995 -3,166
libraries.crossgen2.windows.x64.checked.mch 15,685,800 -8,306
libraries.pmi.windows.x64.checked.mch 59,924,379 -18,880
libraries_tests.pmi.windows.x64.checked.mch 107,991,683 -68,068
realworld.run.windows.x64.checked.mch 13,795,964 -3,982
MinOpts (+133 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 14,870,376 +4
benchmarks.run.windows.x64.checked.mch 374,520 +0
benchmarks.run_pgo.windows.x64.checked.mch 10,282,013 -7
benchmarks.run_tiered.windows.x64.checked.mch 7,604,758 -10
coreclr_tests.run.windows.x64.checked.mch 269,307,950 +146
libraries.pmi.windows.x64.checked.mch 1,519,268 +0
libraries_tests.pmi.windows.x64.checked.mch 5,507,093 +0
realworld.run.windows.x64.checked.mch 1,149,238 +0
FullOpts (-122,334 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 29,842,708 -7,641
benchmarks.run.windows.x64.checked.mch 6,930,212 -1,266
benchmarks.run_pgo.windows.x64.checked.mch 16,804,546 -10,933
benchmarks.run_tiered.windows.x64.checked.mch 2,992,489 +54
coreclr_tests.run.windows.x64.checked.mch 116,193,045 -3,312
libraries.crossgen2.windows.x64.checked.mch 15,685,046 -8,306
libraries.pmi.windows.x64.checked.mch 58,405,111 -18,880
libraries_tests.pmi.windows.x64.checked.mch 102,484,590 -68,068
realworld.run.windows.x64.checked.mch 12,646,726 -3,982

@tannergooding tannergooding marked this pull request as ready for review June 28, 2023 19:24
@tannergooding
Copy link
Member Author

CC. @kunalspathak

REG_R8,REG_R9,REG_R10,REG_R11, \
REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \
REG_R14,REG_R15,REG_R12,REG_R13
#define REG_VAR_ORDER_CALLEE_TRASH REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EAX being first gives lots of wins because its the return register, has specialized smaller encodings, etc.

ECX before EDX does make a substantial difference, namely because EDX gets used by various instructions as a forced register. If we have EDX then ECX we only see -19,743 overall codegen (so only 1/3 the wins). Despite that, EDX is still better than R8-R11 since its a 1-byte smaller encoding for every usage saving us space overall.

REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \
REG_R14,REG_R15,REG_R12,REG_R13
#define REG_VAR_ORDER_CALLEE_TRASH REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11
#define REG_VAR_ORDER_CALLEE_SAVED REG_EBX,REG_ESI,REG_EDI,REG_ETW_FRAMED_EBP_LIST REG_R14,REG_R15,REG_R12,REG_R13
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, EBX doesn't really have any special usages so it being first gives the most wins. ESI and EDI do have some special usages, but not a ton, so they come next.

EBP, despite taking 4 extra bytes for some addressing modes, is still an overall win over R14-R15 since its 1-byte smaller per usage for everything else, saving space overall.

R12-R13 need to be last since R12 takes an extra byte for the addressing mode and R13 similarly uses extra space to encode the displacement.

#define REG_VAR_ORDER_FLT_CALLEE_SAVED REG_XMM6,REG_XMM7,REG_XMM8,REG_XMM9,REG_XMM10,REG_XMM11,REG_XMM12, \
REG_XMM13,REG_XMM14,REG_XMM15

#define REG_VAR_ORDER_FLT_EVEX_CALLEE_TRASH REG_XMM0,REG_XMM1,REG_XMM2,REG_XMM3,REG_XMM4,REG_XMM5,REG_XMM16, \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want XMM16-XMM31 to come before the callee saved since the spill and load (2 instructions, 8 bytes minimum) are overall more expensive than the extra byte per instruction

@tannergooding tannergooding added the avx512 Related to the AVX-512 architecture label Jun 28, 2023
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any regression at all in code size? looks like superpmi-diff is still running.

src/coreclr/jit/lsraxarch.cpp Show resolved Hide resolved
#else
// TEMPORARY ORDER TO AVOID CALLEE-SAVES
// TODO-CQ: Review this and set appropriately
// Registers are ordered here to prefer callee trash first, then callee save
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me, but why we prefer callee trash first? They need to be save/restored during the call, right, so in that case, callee save should be better, no?

Copy link
Member Author

@tannergooding tannergooding Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lengthy comment explaining it, but the basics is that we've already been assumed to have used every callee trash (aka volatile or caller saved) register. We only have to save them if the value they contain is live across a call boundary, so saving them is not applicable at all to leaf methods.

Callee saved (aka non-volatile) registers on the other hand have an inherent overhead regardless of whether we have calls or not. This is because the method consuming them must preserve the existing value (typically as part of the prologue) and restore that same value before returning (typically as part of the epilogue). They are only beneficial to use if we have a value that needs to live across a call boundary.

I think in a more ideal world LSRA would independently track callee trash vs callee saved order and it would pick from one of the two pools based on whether the value needs to live across a call boundary. It would do this based on whether a block is hot vs cold and its weight since a callee trash register that is spilled once is "equal" to a callee saved and so callee trash remains "better". But if the value needs to be spilled twice, then callee saved becomes cheaper since its just 1 spill/restore needed for the whole method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callee trash: We only have to save them if the value they contain is live across a call boundary

Right, there is a cost to save/restore them across call boundaries, but callee save are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries, and hence should be preferred than callee trash.

Related work: #76904 and #81242

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but callee save are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries

Right, they can be cheaper in the right circumstances, the inverse is also true:

  • A value x that is live across 0 calls needs 0 (callee trash) or 1 (callee saved) spills
  • A value x that is live across 1 call needs 1 (callee trash or callee saved) spill
  • A value x and value y that are live across the same 1 call needs 2 (callee trash or callee saved) spills
  • A value x that is live across 2 calls needs 2 (callee trash) or 1 (callee saved) spills
  • A value x that is live across call A and a value y that is live across call B needs 2 (callee trash) or 1 (callee saved) spills

Locally Testing SPMI shows that having the total order as callee trash, callee saved vs callee saved, callee trash actually makes no difference. This is because of the combination of preferencing done and interval->preferCalleeSave tracking we already do which causes LinearScan::RegisterSelection::select to already filter down to callee trash vs callee saved registers. The ordering within each respective group does still matter and is what this PR touches on (this ordering is used to break ties and comes after other filtering/selection is done).

The overall default state for the LSRA is then callee trash, callee saved since there are overall more callee trash registers, they are frequently preferenced due to their special uses (returns, parameters, required for some instructions, etc), and they are cheaper or the same cost in 3/5 of the scenarios listed above, with those other scenarios already being specially tracked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferCalleeSave tracking we do isn't ideal and that's partially where #76904 and #81242 come into play.

We also notably don't do tie breaking super efficiently as we currently iterate through all candidates each time comparing against their respective ordering to find the smallest one (when an instruction has no preferences, this is all unused registers which is often around half of the total register count).

src/coreclr/jit/targetamd64.h Outdated Show resolved Hide resolved
src/coreclr/jit/targetamd64.h Outdated Show resolved Hide resolved
@tannergooding
Copy link
Member Author

Diffs

We're seeing 307k codegen improvements in Linux x64, with 42k being for MinOpts. We're likewise seeing 169k improvements for Windows x64.

Compiling with MSVC shows tp improvements to minopts (-0.08%) and a minor improvement to fullopts (-0.01%). Compiling with Clang shows TP regressions. Clang in general likes to emit more code for certain types of loops and this causes the "regression". The actual code is typically on par or faster.


For Linux, the benchmarks.run and benchmarks.run_tiered do show overall regressions. benchmarks.run_pgo does not and is one of the largest improvements.

For both Windows x64 and Linux x64, there are various individual methods with regressions:

  • Linux x64: -661,741; +354,732
  • Windows x64: -355,571; +186,440

The improvements and regressions are basically just inverse of eachother. Improvements come from places where the combination of ordering and preferencing allows for less spilling, less register swapping, etc. Regressions come from places where we see more spilling or register swapping.

Some of this looks to be minor places where LSRA seemingly decides to introduce a move unnecessarily, such as:

        ; gcrRegs -[rcx]
        cmp      ecx, edx
        jbe      SHORT G_M20342_IG04
-       mov      rax, gword ptr [rax+8*rdx+10H]
-						;; size=19 bbWeight=1 PerfScore 9.25
+       mov      ecx, edx
+       mov      rax, gword ptr [rax+8*rcx+10H]
+						;; size=21 bbWeight=1 PerfScore 9.50
 G_M20342_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 40
        ret    

It could have kept the value in edx, particularly given it was last use, but opted not to. Some of this may be low hanging fruit we can easily clean up. Others will require a bit more investigation.

Overall the change should be worth taking given its a win by a significant margin.

@kunalspathak
Copy link
Member

any reason why there are huge regressions and minimal improvements for x86?

image

Compiling with MSVC shows tp improvements to minopts (-0.08%) and a minor improvement to fullopts (-0.01%)

What justifies the improvement? Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers? Did you try checking? because clearly the change in lsrabuild.cpp should have shown regressions.

@tannergooding
Copy link
Member Author

tannergooding commented Jun 30, 2023

any reason why there are huge regressions and minimal improvements for x86?

x86 only has 6 registers available. 3 callee trash and 3 callee saved. This means it requires a lot more spills to do even basic function handling. Locally I had seen a small bit of improvement, but I've just reverted this since it's showing regressions in CI.

What justifies the improvement? Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers? Did you try checking? because clearly the change in lsrabuild.cpp should have shown regressions.

Not sure what you mean.

Previously we did:

  • loop from REG_FIRST to AVAILABLE_REG_COUNT calling RegRecord::init(reg)
  • loop through lsraRegOrder setting RegRecord::regOrder
  • loop through lsraRegOrderFlt setting RegRecord::regOrder
  • conditionally loop through lsraRegOrderFltUpper setting RegRecord::regOrder if canUseEvexEncoding

Now we're just doing:

  • loop from REG_FIRST to AVAILABLE_REG_COUNT calling RegRecord::init(reg)
  • loop through lsraRegOrder setting RegRecord::regOrder
  • loop through lsraRegOrderFlt or lsraRegOrderFltEvex, based on canUseEvexEncoding, setting RegRecord::regOrder

The first two steps are identical. The third and fourth steps are now simply combined so we can loop once, rather than twice. We're still doing the same amount of loop iterations (16 vs 32) but are now able to avoid the "overhead" of a second loop and the locality split between two arrays.

We then are also no longer trying to call BuildEvexIncompatibleMask for every GT_CNS_* (INT, LNG, DBL, VEC) which saves some TP as well.

@tannergooding
Copy link
Member Author

Perhaps allocator is finding a good candidate to allocate sooner rather than checking for more registers

The register allocator currently walks all candidates every time to find the one with the lowest order, so changing the order doesn't really provide much in the face of TP wins. We probably should change how the allocator does this so we can do the lookup faster, but that's a separate work item and would likewise be unaffected by ordering.

@tannergooding
Copy link
Member Author

New diffs

Same results for x64, no diffs for x86 now.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit 453d58e into dotnet:main Jun 30, 2023
145 checks passed
@tannergooding tannergooding deleted the evex-reg-order branch June 30, 2023 15:43
@ghost ghost locked as resolved and limited conversation to collaborators Jul 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants