ARM64 - Allow double constants to CSE more often #65176

TIHan · 2022-02-11T03:36:55Z

Will try to address: #35257

Description
At the moment on ARM64, we say that double constants are cheap enough to not CSE if it fits within the range of valid immediate values for fmov. I believe the reason we do this is because we think the constants are more likely to be contained in lowering. However, there are no ARM64 floating-point instructions that support immediate values apart from fmov, therefore it's not beneficial to disallow CSE more often for double constants.

This PR almost matches the behavior of x64 for double constants simply by increasing the cost by 1 when the value is in the immediate value range for fmov.

There are regressions, but when combined with this change: #71174, there are less regressions because it will not allow CSE'ing of double constants when part of a GT_SIMD Init/InitN intrinsic.

Acceptance Criteria

Merge Do not CSE constant operands for GT_SIMD vector constructors #71174

ghost · 2022-02-11T03:37:04Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Will try to address: #64794

Author:	TIHan
Assignees:	TIHan
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/gentree.cpp

kunalspathak · 2022-02-11T21:43:53Z

Regarding the arm64 build failures, you can repro it on your machine if you do

build.cmd clr.corelib+Clr.NativeCoreLib -c Release -runtimeconfiguration Checked -arch arm64

kunalspathak · 2022-02-12T04:46:24Z

The builds are still failing with AV. Not sure if compCurBB is valid at the point where you are checking.

TIHan · 2022-02-12T22:30:59Z

It is valid, but crossgen2 is failing as it's hitting an assertion. I think we need to make more adjustments to this. I'm not familiar enough with CSE yet so I would need to spend some time with this.

kunalspathak · 2022-02-16T19:19:16Z

The builds are still failing with AV. Not sure if compCurBB is valid at the point where you are checking.

I verified and it repros on local machine. compCurBB is nullptr.

2:051> kc5
 # Call Site
00 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
01 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
02 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
03 clrjit_universal_arm64_x64!Compiler::gtSetStmtInfo
04 clrjit_universal_arm64_x64!Compiler::fgOptimizeBranch
2:051> ?? this->compCurBB
struct BasicBlock * 0x00000000`00000000

ghost · 2022-03-18T23:00:55Z

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

TIHan · 2022-06-23T17:30:40Z

@kunalspathak This is basically done, but we need to merge #71174 to get an accurate assessment of the regressions.

src/coreclr/jit/gentree.cpp

TIHan · 2022-06-23T20:05:36Z

I went ahead and merged with #71174.

@EgorBo and I had an offline convo where we can apply a cost for fmov which is 2, instead of 1, but ldr will remain a cost of 3.

TIHan · 2022-06-23T22:46:02Z

Diffs

@dotnet/jit-contrib This is ready.

kunalspathak · 2022-06-24T17:39:40Z

Could you look at below diffs windows/arm64 benchmarks and see why they came up with your change?
15858.dasm

For this one, just double check why the fmov are not hoisted out of the outermost loop.

14301.dasm

TIHan · 2022-06-28T17:19:02Z

@kunalspathak been looking at the ubfiz sample. We emit ubfiz when we find this pattern in lowering: LSH(CAST(X), CNS_INT). It is quite complicated, but I think it is due to that pattern not getting CSE'ed anymore as we have reached the max cse count:

Notice that the CAST(X) part of the LSH(CAST(X), CNS_INT) is marked as a CSE candidate. Because of that, the cast eventually gets removed later, therefore, breaks the pattern:

The image above, on the left(base), cse14 is this:

N013 (  4,  6) CSE #64 (def)[003944] -A------R--                            |                 +--*  ASG       long   $VN.Void
N012 (  1,  1)              [003943] D------N---                            |                 |  +--*  LCL_VAR   long   V55 cse14         $VN.Void
N011 (  4,  6)              [002956] -----------                            |                 |  \--*  LSH       long   $428
N009 (  2,  3) CSE #63 (def)[002957] ---------U-                            |                 |     +--*  CAST      long <- uint $427
N008 (  1,  1)              [002958] -----------                            |                 |     |  \--*  LCL_VAR   int    V10 loc6         u:9 $2d2
N010 (  1,  2)              [002959] -------N---                            |                 |     \--*  CNS_INT   long   3 $381
N014 (  1,  1)              [003945] -----------                            |                 \--*  LCL_VAR   long   V55 cse14

So basically, the pattern LSH(CAST(X), CNS_INT) used to be entirely a CSE but now it is not because we reached the cap. This makes sense since we are allowing double constants to be CSE'ed more.

I think this is probably fine as this doesn't seem to be common. I did try an experiment by marking the CAST(X) as DoNotCSE if it is part of that pattern for ARM64, it seems to work but not entirely sure of all the ramifications of doing that. Another possibility is trying to cost the pattern LSH(CAST(X), CNS_INT) in such a way that gives it more priority in CSE.

tannergooding · 2022-06-28T17:51:34Z

What's the current "max CSE" count? Given all the improvements we've had to various places causing more CSEs to be possible, is it potentially worth revisiting this count and upping it (for .NET 8)?

TIHan · 2022-06-28T23:13:05Z

What's the current "max CSE" count?

It's 64 at the moment.

@kunalspathak I think the change is the best we can do at the moment. I can't quite figure out why we are not hoisting it out a second time from 14301.dasm, even when looking at the JitDump it is quite hard. I didn't want to spend much more time looking into it.

Increasing the costEx values even further tends to give more improvements but also more regressions; so, I think the cost of where it is at now in the PR is the most optimal without doing anything extreme.

kunalspathak · 2022-07-01T18:00:41Z

Looking at some diffs of BenchF.LLoops:Main1 (33211.dasm), there is something odd going. We CSE the constant, but decide to store it on stack and reload it inside the loop, not something that we would like to see.

We use ldr instead of fmov despite the comment says that fmov is cheaper

and then, we store it on stack again (not needed):

It might be not because of this PR, but with that, some odd decisions are happening in LSRA that we should understand first, before merging it. Can you get me a before/after JITDump so I can understand why LSRA is behaving this way?

kunalspathak · 2022-07-01T18:03:23Z

I see this in other benchmarks too, for eg. MDBenchF.MDLLoops:Main1 14305.dasm:

kunalspathak · 2022-07-01T18:11:47Z

There is odd diffs in 218913 windows-arm64:

kunalspathak · 2022-07-01T18:15:13Z

I went through other diffs and they looks good except the weird LSRA issue and some unwanted diffs in test. Please double check why they are coming.

TIHan · 2022-07-05T19:55:40Z

Diffs in 218913 look strange. I will look at that.

The diffs in 14305.dasm, this function is probably hitting cost limits.

TIHan · 2022-07-07T01:34:31Z

@kunalspathak Been deeply looking at 218913:

With this PR, it creates a CSE for a double constant:

But because constant propagation does not happen for CSEs, constant folding does not occur and an assertion prop for OAK_EQUAL gets missed.

I did some tinkering with assertion prop where I could get it to recognize that the cse variable represents a constant but I don't know if we want to do that. I didn't get a lot of diffs, though it would fix this regression when combined with this PR.

But, this is part of coreclr_tests which does crazy things; I probably wouldn't go further than we already have.

kunalspathak · 2022-07-07T15:09:12Z

I probably wouldn't go further than we already have.

Sounds good to me.

The diffs in 14305.dasm, this function is probably hitting cost limits.

Do you mean CSE limits?

I would be still interested in knowing the reasoning behind #65176 (comment) because they do show-up in some benchmark diffs (in hot loops), especially the ones where we spill the constant on stack.

TIHan · 2022-07-11T19:20:13Z

Pushing this to .NET 8 as we need to investigate more in the regressions.

JulieLeeMSFT · 2022-09-06T18:43:39Z

@TIHan please put this to draft until you are ready to work on this.

Initial heuristic change for double constant adjustment

2a37283

ghost assigned TIHan Feb 11, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 11, 2022

TIHan changed the title ~~[WIP] Initial heuristic change for double constant adjustment~~ [WIP] Heuristic change for double constant adjustment Feb 11, 2022

tannergooding reviewed Feb 11, 2022

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Feb 11, 2022

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

kunalspathak reviewed Feb 11, 2022

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

TIHan added 2 commits February 11, 2022 17:57

Fixed heuristic logic

4670992

Fixed build

ce52e2c

Merge remote-tracking branch 'upstream/main' into double-const-adjust

19ea7ab

ghost closed this Mar 18, 2022

ghost locked as resolved and limited conversation to collaborators Apr 18, 2022

TIHan reopened this Jun 17, 2022

TIHan added 10 commits June 22, 2022 16:33

Do not CSE constant operands for GT_SIMD

96c46a8

Re-arrange logic

959cce9

Merging

3d450df

Feedback

415e964

Fixing build

12653fd

Feedback

5279dd9

Feedback

1c04881

Feedback

9d1857a

Feedback

0539c26

Feedback

780ef32

TIHan added this to the 7.0.0 milestone Jun 23, 2022

TIHan changed the title ~~[WIP] Heuristic change for double constant adjustment~~ ARM64 - Allow double constants to be hoisted Jun 23, 2022

TIHan changed the title ~~ARM64 - Allow double constants to be hoisted~~ ARM64 - Allow double constants to be CSE more often, matching x64 behavior Jun 23, 2022

TIHan changed the title ~~ARM64 - Allow double constants to be CSE more often, matching x64 behavior~~ ARM64 - Allow double constants to CSE more often, matching x64 behavior Jun 23, 2022

TIHan marked this pull request as ready for review June 23, 2022 17:18

EgorBo reviewed Jun 23, 2022

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

TIHan added 2 commits June 23, 2022 13:01

Merge branch 'simd-dont-cse-const' into double-const-adjust

146f6b9

Added heuristic for 'fmov' to be cheaper than 'ldr'

97a19f3

TIHan changed the title ~~ARM64 - Allow double constants to CSE more often, matching x64 behavior~~ ARM64 - Allow double constants to CSE more often Jun 23, 2022

Merge remote-tracking branch 'upstream/main' into double-const-adjust

0932f6a

TIHan modified the milestones: 7.0.0, 8.0.0 Jul 11, 2022

TIHan marked this pull request as draft September 6, 2022 22:53

ghost closed this Oct 6, 2022

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64 - Allow double constants to CSE more often #65176

ARM64 - Allow double constants to CSE more often #65176

TIHan commented Feb 11, 2022 •

edited

Loading

ghost commented Feb 11, 2022

kunalspathak commented Feb 11, 2022

kunalspathak commented Feb 12, 2022

TIHan commented Feb 12, 2022

kunalspathak commented Feb 16, 2022

ghost commented Mar 18, 2022

TIHan commented Jun 23, 2022

TIHan commented Jun 23, 2022 •

edited

Loading

TIHan commented Jun 23, 2022

kunalspathak commented Jun 24, 2022

TIHan commented Jun 28, 2022

tannergooding commented Jun 28, 2022

TIHan commented Jun 28, 2022

kunalspathak commented Jul 1, 2022 •

edited

Loading

kunalspathak commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

TIHan commented Jul 5, 2022

TIHan commented Jul 7, 2022

kunalspathak commented Jul 7, 2022 •

edited

Loading

TIHan commented Jul 11, 2022

JulieLeeMSFT commented Sep 6, 2022

ARM64 - Allow double constants to CSE more often #65176

ARM64 - Allow double constants to CSE more often #65176

Conversation

TIHan commented Feb 11, 2022 • edited Loading

ghost commented Feb 11, 2022

kunalspathak commented Feb 11, 2022

kunalspathak commented Feb 12, 2022

TIHan commented Feb 12, 2022

kunalspathak commented Feb 16, 2022

ghost commented Mar 18, 2022

TIHan commented Jun 23, 2022

TIHan commented Jun 23, 2022 • edited Loading

TIHan commented Jun 23, 2022

kunalspathak commented Jun 24, 2022

TIHan commented Jun 28, 2022

tannergooding commented Jun 28, 2022

TIHan commented Jun 28, 2022

kunalspathak commented Jul 1, 2022 • edited Loading

kunalspathak commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

TIHan commented Jul 5, 2022

TIHan commented Jul 7, 2022

kunalspathak commented Jul 7, 2022 • edited Loading

TIHan commented Jul 11, 2022

JulieLeeMSFT commented Sep 6, 2022

TIHan commented Feb 11, 2022 •

edited

Loading

TIHan commented Jun 23, 2022 •

edited

Loading

kunalspathak commented Jul 1, 2022 •

edited

Loading

kunalspathak commented Jul 7, 2022 •

edited

Loading