-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak the register order for xarch to better account for callee saved #88151
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWith the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering. General purpose registers were already generally following the
|
b826927
to
9b23a15
Compare
9b23a15
to
d2c4375
Compare
Diffs are based on 1,563,220 contexts (466,721 MinOpts, 1,096,499 FullOpts). MISSED contexts: 259 (0.02%) Overall (-122,201 bytes)
MinOpts (+133 bytes)
FullOpts (-122,334 bytes)
|
CC. @kunalspathak |
src/coreclr/jit/targetamd64.h
Outdated
REG_R8,REG_R9,REG_R10,REG_R11, \ | ||
REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \ | ||
REG_R14,REG_R15,REG_R12,REG_R13 | ||
#define REG_VAR_ORDER_CALLEE_TRASH REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EAX
being first gives lots of wins because its the return register, has specialized smaller encodings, etc.
ECX
before EDX
does make a substantial difference, namely because EDX
gets used by various instructions as a forced register. If we have EDX
then ECX
we only see -19,743
overall codegen (so only 1/3 the wins). Despite that, EDX
is still better than R8-R11
since its a 1-byte smaller encoding for every usage saving us space overall.
src/coreclr/jit/targetamd64.h
Outdated
REG_ESI,REG_EDI,REG_EBX,REG_ETW_FRAMED_EBP_LIST \ | ||
REG_R14,REG_R15,REG_R12,REG_R13 | ||
#define REG_VAR_ORDER_CALLEE_TRASH REG_EAX,REG_ECX,REG_EDX,REG_R8,REG_R9,REG_R10,REG_R11 | ||
#define REG_VAR_ORDER_CALLEE_SAVED REG_EBX,REG_ESI,REG_EDI,REG_ETW_FRAMED_EBP_LIST REG_R14,REG_R15,REG_R12,REG_R13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, EBX
doesn't really have any special usages so it being first gives the most wins. ESI
and EDI
do have some special usages, but not a ton, so they come next.
EBP
, despite taking 4 extra bytes for some addressing modes, is still an overall win over R14-R15
since its 1-byte smaller per usage for everything else, saving space overall.
R12-R13
need to be last since R12
takes an extra byte for the addressing mode and R13
similarly uses extra space to encode the displacement.
src/coreclr/jit/targetamd64.h
Outdated
#define REG_VAR_ORDER_FLT_CALLEE_SAVED REG_XMM6,REG_XMM7,REG_XMM8,REG_XMM9,REG_XMM10,REG_XMM11,REG_XMM12, \ | ||
REG_XMM13,REG_XMM14,REG_XMM15 | ||
|
||
#define REG_VAR_ORDER_FLT_EVEX_CALLEE_TRASH REG_XMM0,REG_XMM1,REG_XMM2,REG_XMM3,REG_XMM4,REG_XMM5,REG_XMM16, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want XMM16-XMM31
to come before the callee saved since the spill and load (2 instructions, 8 bytes minimum) are overall more expensive than the extra byte per instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see any regression at all in code size? looks like superpmi-diff is still running.
src/coreclr/jit/targetamd64.h
Outdated
#else | ||
// TEMPORARY ORDER TO AVOID CALLEE-SAVES | ||
// TODO-CQ: Review this and set appropriately | ||
// Registers are ordered here to prefer callee trash first, then callee save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me, but why we prefer callee trash first? They need to be save/restored during the call, right, so in that case, callee save should be better, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a lengthy comment explaining it, but the basics is that we've already been assumed to have used every callee trash
(aka volatile
or caller saved
) register. We only have to save them if the value they contain is live
across a call boundary, so saving them is not applicable at all to leaf
methods.
Callee saved
(aka non-volatile
) registers on the other hand have an inherent overhead regardless of whether we have calls or not. This is because the method consuming them must preserve the existing value (typically as part of the prologue) and restore that same value before returning (typically as part of the epilogue). They are only beneficial to use if we have a value that needs to live across a call boundary.
I think in a more ideal world LSRA would independently track callee trash
vs callee saved
order and it would pick from one of the two pools based on whether the value needs to live across a call boundary. It would do this based on whether a block is hot vs cold and its weight since a callee trash
register that is spilled once is "equal" to a callee saved
and so callee trash
remains "better". But if the value needs to be spilled twice, then callee saved
becomes cheaper since its just 1 spill/restore needed for the whole method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callee trash
: We only have to save them if the value they contain is live across a call boundary
Right, there is a cost to save/restore them across call boundaries, but callee save
are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries, and hence should be preferred than callee trash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but callee save are cheaper to allocate in my opinion because they are anyway saved and restored during prolog/epilog and hence no separate store/restore needed across call boundaries
Right, they can be cheaper in the right circumstances, the inverse is also true:
- A value
x
that is live across 0 calls needs 0 (callee trash) or 1 (callee saved) spills - A value
x
that is live across 1 call needs 1 (callee trash or callee saved) spill - A value
x
and valuey
that are live across the same 1 call needs 2 (callee trash or callee saved) spills - A value
x
that is live across 2 calls needs 2 (callee trash) or 1 (callee saved) spills - A value
x
that is live across callA
and a valuey
that is live across callB
needs 2 (callee trash) or 1 (callee saved) spills
Locally Testing SPMI shows that having the total order as callee trash, callee saved
vs callee saved, callee trash
actually makes no difference. This is because of the combination of preferencing done and interval->preferCalleeSave
tracking we already do which causes LinearScan::RegisterSelection::select
to already filter down to callee trash
vs callee saved
registers. The ordering within each respective group does still matter and is what this PR touches on (this ordering is used to break ties and comes after other filtering/selection is done).
The overall default state for the LSRA is then callee trash, callee saved
since there are overall more callee trash
registers, they are frequently preferenced due to their special uses (returns, parameters, required for some instructions, etc), and they are cheaper or the same cost in 3/5 of the scenarios listed above, with those other scenarios already being specially tracked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preferCalleeSave
tracking we do isn't ideal and that's partially where #76904 and #81242 come into play.
We also notably don't do tie breaking super efficiently as we currently iterate through all candidates each time comparing against their respective ordering to find the smallest one (when an instruction has no preferences, this is all unused registers which is often around half of the total register count).
3f7331d
to
e433989
Compare
We're seeing 307k codegen improvements in Linux x64, with 42k being for MinOpts. We're likewise seeing 169k improvements for Windows x64. Compiling with MSVC shows tp improvements to minopts ( For Linux, the For both Windows x64 and Linux x64, there are various individual methods with regressions:
The improvements and regressions are basically just inverse of eachother. Improvements come from places where the combination of ordering and preferencing allows for less spilling, less register swapping, etc. Regressions come from places where we see more spilling or register swapping. Some of this looks to be minor places where LSRA seemingly decides to introduce a move unnecessarily, such as: ; gcrRegs -[rcx]
cmp ecx, edx
jbe SHORT G_M20342_IG04
- mov rax, gword ptr [rax+8*rdx+10H]
- ;; size=19 bbWeight=1 PerfScore 9.25
+ mov ecx, edx
+ mov rax, gword ptr [rax+8*rcx+10H]
+ ;; size=21 bbWeight=1 PerfScore 9.50
G_M20342_IG03: ; bbWeight=1, epilog, nogc, extend
add rsp, 40
ret It could have kept the value in Overall the change should be worth taking given its a win by a significant margin. |
x86 only has 6 registers available. 3 callee trash and 3 callee saved. This means it requires a lot more spills to do even basic function handling. Locally I had seen a small bit of improvement, but I've just reverted this since it's showing regressions in CI.
Not sure what you mean. Previously we did:
Now we're just doing:
The first two steps are identical. The third and fourth steps are now simply combined so we can loop once, rather than twice. We're still doing the same amount of loop iterations (16 vs 32) but are now able to avoid the "overhead" of a second loop and the locality split between two arrays. We then are also no longer trying to call |
The register allocator currently walks all candidates every time to find the one with the lowest order, so changing the order doesn't really provide much in the face of TP wins. We probably should change how the allocator does this so we can do the lookup faster, but that's a separate work item and would likewise be unaffected by ordering. |
0113ebf
to
6045683
Compare
New diffs Same results for x64, no diffs for x86 now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
With the introduction of EVEX, we have 16 additional callee trash registers available which should be preferenced over the callee save registers. This ensures that is being taken into account as part of register ordering.
General purpose registers were already generally following the
CALLEE_TRASH, CALLEE_SAVE
ordering. But weren't taking into account that some registers are more expensive to encode and so should be ordered later within the subgrouping (e.g. r8-15 should come after the base 8 registers).