-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of CSE for GT_CNS_INT benefits ARM64 #39096
Conversation
briansull
commented
Jul 10, 2020
•
edited
Loading
edited
ARM64 codesize diffs:
|
@sandreenko PTAL |
This feature is ready for check in @sandreenko PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good in general, a few questions.
Also, do you have top method improvements by percentage available?
src/coreclr/src/jit/morph.cpp
Outdated
@@ -2707,6 +2707,11 @@ void Compiler::fgInitArgInfo(GenTreeCall* call) | |||
indirectCellAddress->AsIntCon()->gtTargetHandle = (size_t)call->gtCallMethHnd; | |||
#endif | |||
indirectCellAddress->SetRegNum(REG_R2R_INDIRECT_PARAM); | |||
#ifdef TARGET_ARM | |||
// Don't attempt to CSE this constant | |||
// This hits an assert: Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please open an issue with a repro steps about that and place its number instead of line that could change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant has a specific register requirement, and I think that the same constant is used in another context with a different register requirement. So presumably the CSE value then has a conflict because it needs to be copied to both registers. I would probably restate this as:
"This constant has specific register requirements, and LSRA doesn't currently correctly handle them when the value is in a CSE'd local." or something to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that that describe the issue, It only happens on Arm32, with physical register r4, we have a similar thing on Arm64 with a physical register r11 and that one works fortunately.
src/coreclr/src/zap/zapinfo.cpp
Outdated
@@ -502,50 +502,61 @@ void ZapInfo::CompileMethod() | |||
ULONG cCode; | |||
|
|||
#ifdef ALLOW_SXS_JIT_NGEN | |||
bool expectedAltJitFailure = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this fallback after other changes in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I generate AltJit jit-diffs, we always get two copies of every method.
This fixes that issue. I could move this into it's own checkin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. I have checked that I see that behavior (crossgen only), but I will need more time to understand your fix, so it will be nice not to block the PR because of this change.
Nice size improvement! Can you update your diffs with the number of improved/regressed methods? Also, can you include an analysis of regressions? |
@erozenfeld A typical regression is where we see a code size increase due to the hoisting of the method addresses from in a loop: |
|
|
f894c39
to
2f4aa60
Compare
@briansull out of curiosity, is the address of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few questions/comments.
@@ -2,7 +2,7 @@ | |||
<PropertyGroup> | |||
<OutputType>Exe</OutputType> | |||
<RestorePackages>true</RestorePackages> | |||
<CLRTestPriority>1</CLRTestPriority> | |||
<CLRTestPriority>0</CLRTestPriority> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you going to revert test changes before merge or want to keep them in pri0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we need this extra test coverage
src/coreclr/src/jit/optcse.cpp
Outdated
|
||
BasicBlock* blk = lst->tslBlock; | ||
const BasicBlock::weight_t curWeight = blk->getBBWeight(m_pCompiler); | ||
if (isConstCSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be nice to shrink this function into smaller in the future, it was already 500+ lines, not it is 800+,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, size diffs are great. Please run jit-stress pipile before merge + a few nits (open an issue about lsra assert instead of referencing a line number, replace magic const 8, see if you can avoid reimplementing sort).
Thanks for the answers and extracting other changes out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though it would be great to add constant definitions for the option values, and an updated comment for the LSRA issue.
src/coreclr/src/jit/morph.cpp
Outdated
@@ -2707,6 +2707,11 @@ void Compiler::fgInitArgInfo(GenTreeCall* call) | |||
indirectCellAddress->AsIntCon()->gtTargetHandle = (size_t)call->gtCallMethHnd; | |||
#endif | |||
indirectCellAddress->SetRegNum(REG_R2R_INDIRECT_PARAM); | |||
#ifdef TARGET_ARM | |||
// Don't attempt to CSE this constant | |||
// This hits an assert: Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant has a specific register requirement, and I think that the same constant is used in another context with a different register requirement. So presumably the CSE value then has a conflict because it needs to be copied to both registers. I would probably restate this as:
"This constant has specific register requirements, and LSRA doesn't currently correctly handle them when the value is in a CSE'd local." or something to that effect.
db65611
to
3b6c32b
Compare
a8e5085
to
a6a40ac
Compare
Manually Ran the JitStress job again |
e11667c
to
32abfee
Compare
My JitStress run is failing due to |
32abfee
to
67bbc65
Compare
…ARM64 Implementation of code size optimization, CSE of constant values for ARM64 We will share a single CSE for constants that differ only in their low 12 bits on ARM64 Number of shared constant low bits set in target.h CSE_CONST_SHARED_LOW_BITS we use 12 bits on Arm platforms and 16 bits on XArch platforms Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32 as it hits Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 Config variable: COMPlus_JitConstCSE // Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64) // If 1, disable all the CSE of Constants // If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64) // If 3, enable the CSE of Constants including nearby offsets. (all platforms) // If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms) //
67bbc65
to
fb2f8d5
Compare
@BruceForstall It looks like a couple of known errors cause these two failures in my JitStress run. runtime-coreclr jitstress (CoreCLR Pri1 Runtime Tests Run OSX x64 checked) runtime-coreclr jitstress (CoreCLR Pri1 Runtime Tests Run Windows_NT x86 checked) |
@briansull Both of those are known. |
* Change the type of csdHashKey to size_t * Update gtCostSz and gtCostEx for constant nodes * Implementation of code size optimization, CSE of constant values for ARM64 Implementation of code size optimization, CSE of constant values for ARM64 We will share a single CSE for constants that differ only in their low 12 bits on ARM64 Number of shared constant low bits set in target.h CSE_CONST_SHARED_LOW_BITS we use 12 bits on Arm platforms and 16 bits on XArch platforms Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32 as it hits Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 Config variable: COMPlus_JitConstCSE // Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64) // If 1, disable all the CSE of Constants // If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64) // If 3, enable the CSE of Constants including nearby offsets. (all platforms) // If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms) // * Added additional Priority 0 test coverage for Floating Point optimizations * Fix for COMPLUS_JitConstCSE=4 * Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE * Updated with Codereview feedback, removed sort from Const CSE phase * Fix for assertionProp issue in the refTypesdynamic test
* Change the type of csdHashKey to size_t * Update gtCostSz and gtCostEx for constant nodes * Implementation of code size optimization, CSE of constant values for ARM64 Implementation of code size optimization, CSE of constant values for ARM64 We will share a single CSE for constants that differ only in their low 12 bits on ARM64 Number of shared constant low bits set in target.h CSE_CONST_SHARED_LOW_BITS we use 12 bits on Arm platforms and 16 bits on XArch platforms Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32 as it hits Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 Config variable: COMPlus_JitConstCSE // Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64) // If 1, disable all the CSE of Constants // If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64) // If 3, enable the CSE of Constants including nearby offsets. (all platforms) // If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms) // * Added additional Priority 0 test coverage for Floating Point optimizations * Fix for COMPLUS_JitConstCSE=4 * Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE * Updated with Codereview feedback, removed sort from Const CSE phase * Fix for assertionProp issue in the refTypesdynamic test
* Change the type of csdHashKey to size_t * Update gtCostSz and gtCostEx for constant nodes * Implementation of code size optimization, CSE of constant values for ARM64 Implementation of code size optimization, CSE of constant values for ARM64 We will share a single CSE for constants that differ only in their low 12 bits on ARM64 Number of shared constant low bits set in target.h CSE_CONST_SHARED_LOW_BITS we use 12 bits on Arm platforms and 16 bits on XArch platforms Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32 as it hits Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723 Config variable: COMPlus_JitConstCSE // Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64) // If 1, disable all the CSE of Constants // If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64) // If 3, enable the CSE of Constants including nearby offsets. (all platforms) // If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms) // * Added additional Priority 0 test coverage for Floating Point optimizations * Fix for COMPLUS_JitConstCSE=4 * Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE * Updated with Codereview feedback, removed sort from Const CSE phase * Fix for assertionProp issue in the refTypesdynamic test
@briansull , I just noticed that we don't CSE the offset calculation, do we? public static uint Compute(uint a)
{
result += (a * 2981231) + 2981235;
a += 2981235;
result += (a * 2981231) + 2981235;
return result;
} Before your CSE change, we generated this: G_M45768_IG02:
528FADE1 movz w1, #0x7d6f
72A005A1 movk w1, #45 LSL #16
1B017C01 mul w1, w0, w1
528FAE62 movz w2, #0x7d73
72A005A2 movk w2, #45 LSL #16
0B020021 add w1, w1, w2
0B020000 add w0, w0, w2
528FADE2 movz w2, #0x7d6f
72A005A2 movk w2, #45 LSL #16
1B027C00 mul w0, w0, w2
0B000020 add w0, w1, w0
528FAE61 movz w1, #0x7d73
72A005A1 movk w1, #45 LSL #16
0B010000 add w0, w0, w1 After this PR, we started generating this which is better but still has potential to optimize away G_M45768_IG02:
528FADE1 movz w1, #0x7d6f
72A005A1 movk w1, #45 LSL #16
1B017C02 mul w2, w0, w1
11001023 add w3, w1, #4
0B030042 add w2, w2, w3
11001023 add w3, w1, #4
0B030000 add w0, w0, w3
1B017C00 mul w0, w0, w1
0B020000 add w0, w0, w2
11001021 add w1, w1, #4
0B010000 add w0, w0, w1 |
@kunalspathak in your example we create one shared constant CSE that covers the five references If your example used different constants that differed by more than 2^12 we would have generated two CSE candidates.
|
Correct. But since the offset calculation is repeated several times, I was expecting that even that calculation gets CSE.
I think it depends on what you choose as the shared constant. Is it on first-seen in the code basis? E.g. in below code, constant public static uint[] Compute()
{
uint[] result = new uint[5];
result[0] = 2981231;
result[1] = 2981235;
result[2] = 2981235;
result[3] = 2981235;
result[4] = 2981235;
return result;
} Before this PR, we would have generated this: G_M30289_IG02:
D2927400 movz x0, #0x93a0
F2BF4700 movk x0, #0xfa38 LSL #16
F2CFFF60 movk x0, #0x7ffb LSL #32
D28000A1 mov x1, #5
97FF3CFA bl CORINFO_HELP_NEWARR_1_VC
528FADE1 movz w1, #0x7d6f
72A005A1 movk w1, #45 LSL #16
B9001001 str w1, [x0,#16]
528FAE61 movz w1, #0x7d73
72A005A1 movk w1, #45 LSL #16
B9001401 str w1, [x0,#20]
B9001801 str w1, [x0,#24]
B9001C01 str w1, [x0,#28]
B9002001 str w1, [x0,#32] However, we now generate this which is longer because of every time we create 2nd constant using G_M30289_IG02:
D2927400 movz x0, #0x93a0
F2BF4720 movk x0, #0xfa39 LSL #16
F2CFFF60 movk x0, #0x7ffb LSL #32
D28000A1 mov x1, #5
97FF3D00 bl CORINFO_HELP_NEWARR_1_VC
528FADE1 movz w1, #0x7d6f
72A005A1 movk w1, #45 LSL #16
B9001001 str w1, [x0,#16]
11001022 add w2, w1, #4
B9001402 str w2, [x0,#20]
11001022 add w2, w1, #4
B9001802 str w2, [x0,#24]
11001022 add w2, w1, #4
B9001C02 str w2, [x0,#28]
11001021 add w1, w1, #4
B9002001 str w1, [x0,#32] A typical use-case I am imagining is where we load 2 or more addresses (each differ from one another such that it falls under the CSE threshold of offset calculation), 1st address is created once, but the other addresses are used more frequently. |
Wonder if there's some residual DONT_CSE on those adds in anticipation of forming address modes on xarch; if so, we should revisit those to see which might need to be relaxed for arm64. |
@kunalspathak The shared const CSE logic selects either the first constant seen or the lowest constant of the shared constants, if the lower constant would need to use a SUB (subtract) instruction. |