Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 - Allow double constants to CSE more often #65176

Closed
wants to merge 17 commits into from

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Feb 11, 2022

Will try to address: #35257

Description
At the moment on ARM64, we say that double constants are cheap enough to not CSE if it fits within the range of valid immediate values for fmov. I believe the reason we do this is because we think the constants are more likely to be contained in lowering. However, there are no ARM64 floating-point instructions that support immediate values apart from fmov, therefore it's not beneficial to disallow CSE more often for double constants.

This PR almost matches the behavior of x64 for double constants simply by increasing the cost by 1 when the value is in the immediate value range for fmov.

There are regressions, but when combined with this change: #71174, there are less regressions because it will not allow CSE'ing of double constants when part of a GT_SIMD Init/InitN intrinsic.

Acceptance Criteria

@ghost ghost assigned TIHan Feb 11, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 11, 2022
@ghost
Copy link

ghost commented Feb 11, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Will try to address: #64794

Author: TIHan
Assignees: TIHan
Labels:

area-CodeGen-coreclr

Milestone: -

@TIHan TIHan changed the title [WIP] Initial heuristic change for double constant adjustment [WIP] Heuristic change for double constant adjustment Feb 11, 2022
@kunalspathak
Copy link
Member

Regarding the arm64 build failures, you can repro it on your machine if you do

build.cmd clr.corelib+Clr.NativeCoreLib -c Release -runtimeconfiguration Checked -arch arm64

@kunalspathak
Copy link
Member

The builds are still failing with AV. Not sure if compCurBB is valid at the point where you are checking.

@TIHan
Copy link
Contributor Author

TIHan commented Feb 12, 2022

It is valid, but crossgen2 is failing as it's hitting an assertion. I think we need to make more adjustments to this. I'm not familiar enough with CSE yet so I would need to spend some time with this.

@kunalspathak
Copy link
Member

The builds are still failing with AV. Not sure if compCurBB is valid at the point where you are checking.

I verified and it repros on local machine. compCurBB is nullptr.

2:051> kc5
 # Call Site
00 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
01 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
02 clrjit_universal_arm64_x64!Compiler::gtSetEvalOrder
03 clrjit_universal_arm64_x64!Compiler::gtSetStmtInfo
04 clrjit_universal_arm64_x64!Compiler::fgOptimizeBranch
2:051> ?? this->compCurBB
struct BasicBlock * 0x00000000`00000000

@ghost ghost closed this Mar 18, 2022
@ghost
Copy link

ghost commented Mar 18, 2022

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 18, 2022
@TIHan TIHan reopened this Jun 17, 2022
@TIHan TIHan added this to the 7.0.0 milestone Jun 23, 2022
@TIHan TIHan changed the title [WIP] Heuristic change for double constant adjustment ARM64 - Allow double constants to be hoisted Jun 23, 2022
@TIHan TIHan changed the title ARM64 - Allow double constants to be hoisted ARM64 - Allow double constants to be CSE more often, matching x64 behavior Jun 23, 2022
@TIHan TIHan changed the title ARM64 - Allow double constants to be CSE more often, matching x64 behavior ARM64 - Allow double constants to CSE more often, matching x64 behavior Jun 23, 2022
@TIHan TIHan marked this pull request as ready for review June 23, 2022 17:18
@TIHan
Copy link
Contributor Author

TIHan commented Jun 23, 2022

@kunalspathak This is basically done, but we need to merge #71174 to get an accurate assessment of the regressions.

@TIHan
Copy link
Contributor Author

TIHan commented Jun 23, 2022

I went ahead and merged with #71174.

@EgorBo and I had an offline convo where we can apply a cost for fmov which is 2, instead of 1, but ldr will remain a cost of 3.

@TIHan TIHan changed the title ARM64 - Allow double constants to CSE more often, matching x64 behavior ARM64 - Allow double constants to CSE more often Jun 23, 2022
@TIHan
Copy link
Contributor Author

TIHan commented Jun 23, 2022

Diffs

@dotnet/jit-contrib This is ready.

@kunalspathak
Copy link
Member

Could you look at below diffs windows/arm64 benchmarks and see why they came up with your change?
15858.dasm

image

For this one, just double check why the fmov are not hoisted out of the outermost loop.

14301.dasm
image

@TIHan
Copy link
Contributor Author

TIHan commented Jun 28, 2022

@kunalspathak been looking at the ubfiz sample. We emit ubfiz when we find this pattern in lowering: LSH(CAST(X), CNS_INT). It is quite complicated, but I think it is due to that pattern not getting CSE'ed anymore as we have reached the max cse count:
image
Notice that the CAST(X) part of the LSH(CAST(X), CNS_INT) is marked as a CSE candidate. Because of that, the cast eventually gets removed later, therefore, breaks the pattern:
image
The image above, on the left(base), cse14 is this:

N013 (  4,  6) CSE #64 (def)[003944] -A------R--                            |                 +--*  ASG       long   $VN.Void
N012 (  1,  1)              [003943] D------N---                            |                 |  +--*  LCL_VAR   long   V55 cse14         $VN.Void
N011 (  4,  6)              [002956] -----------                            |                 |  \--*  LSH       long   $428
N009 (  2,  3) CSE #63 (def)[002957] ---------U-                            |                 |     +--*  CAST      long <- uint $427
N008 (  1,  1)              [002958] -----------                            |                 |     |  \--*  LCL_VAR   int    V10 loc6         u:9 $2d2
N010 (  1,  2)              [002959] -------N---                            |                 |     \--*  CNS_INT   long   3 $381
N014 (  1,  1)              [003945] -----------                            |                 \--*  LCL_VAR   long   V55 cse14 

So basically, the pattern LSH(CAST(X), CNS_INT) used to be entirely a CSE but now it is not because we reached the cap. This makes sense since we are allowing double constants to be CSE'ed more.

I think this is probably fine as this doesn't seem to be common. I did try an experiment by marking the CAST(X) as DoNotCSE if it is part of that pattern for ARM64, it seems to work but not entirely sure of all the ramifications of doing that. Another possibility is trying to cost the pattern LSH(CAST(X), CNS_INT) in such a way that gives it more priority in CSE.

@tannergooding
Copy link
Member

What's the current "max CSE" count? Given all the improvements we've had to various places causing more CSEs to be possible, is it potentially worth revisiting this count and upping it (for .NET 8)?

@TIHan
Copy link
Contributor Author

TIHan commented Jun 28, 2022

What's the current "max CSE" count?

It's 64 at the moment.

@kunalspathak I think the change is the best we can do at the moment. I can't quite figure out why we are not hoisting it out a second time from 14301.dasm, even when looking at the JitDump it is quite hard. I didn't want to spend much more time looking into it.

Increasing the costEx values even further tends to give more improvements but also more regressions; so, I think the cost of where it is at now in the PR is the most optimal without doing anything extreme.

@kunalspathak
Copy link
Member

kunalspathak commented Jul 1, 2022

Looking at some diffs of BenchF.LLoops:Main1 (33211.dasm), there is something odd going. We CSE the constant, but decide to store it on stack and reload it inside the loop, not something that we would like to see.

image

We use ldr instead of fmov despite the comment says that fmov is cheaper

image

and then, we store it on stack again (not needed):

image

It might be not because of this PR, but with that, some odd decisions are happening in LSRA that we should understand first, before merging it. Can you get me a before/after JITDump so I can understand why LSRA is behaving this way?

@kunalspathak
Copy link
Member

I see this in other benchmarks too, for eg. MDBenchF.MDLLoops:Main1 14305.dasm:

image

@kunalspathak
Copy link
Member

There is odd diffs in 218913 windows-arm64:

image

@kunalspathak
Copy link
Member

I went through other diffs and they looks good except the weird LSRA issue and some unwanted diffs in test. Please double check why they are coming.

@TIHan
Copy link
Contributor Author

TIHan commented Jul 5, 2022

Diffs in 218913 look strange. I will look at that.

The diffs in 14305.dasm, this function is probably hitting cost limits.

@TIHan
Copy link
Contributor Author

TIHan commented Jul 7, 2022

@kunalspathak Been deeply looking at 218913:

With this PR, it creates a CSE for a double constant:

image

But because constant propagation does not happen for CSEs, constant folding does not occur and an assertion prop for OAK_EQUAL gets missed.
image

I did some tinkering with assertion prop where I could get it to recognize that the cse variable represents a constant but I don't know if we want to do that. I didn't get a lot of diffs, though it would fix this regression when combined with this PR.

But, this is part of coreclr_tests which does crazy things; I probably wouldn't go further than we already have.

@kunalspathak
Copy link
Member

kunalspathak commented Jul 7, 2022

I probably wouldn't go further than we already have.

Sounds good to me.

The diffs in 14305.dasm, this function is probably hitting cost limits.

Do you mean CSE limits?

I would be still interested in knowing the reasoning behind #65176 (comment) because they do show-up in some benchmark diffs (in hot loops), especially the ones where we spill the constant on stack.

@TIHan TIHan modified the milestones: 7.0.0, 8.0.0 Jul 11, 2022
@TIHan
Copy link
Contributor Author

TIHan commented Jul 11, 2022

Pushing this to .NET 8 as we need to investigate more in the regressions.

@JulieLeeMSFT
Copy link
Member

@TIHan please put this to draft until you are ready to work on this.

@TIHan TIHan marked this pull request as draft September 6, 2022 22:53
@ghost ghost closed this Oct 6, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants