Initial support for zmm in .NET #80960

DeepakRajendrakumaran · 2023-01-21T00:14:08Z

This commit includes the following:

Changes required to support zmm registers
changes required to accelerate a subset of Vector512 methods - Create() and Zero using avx512
Changes required to save/restore zmm registers when needed.

ghost · 2023-01-21T00:14:22Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This commit includes the

Changes required to support zmm registers
changes required to accelerate a subset of Vector512 methods - Create() and Zero using avx512
Changes required to save/restore zmm registers when needed.

Author:	DeepakRajendrakumaran
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

DeepakRajendrakumaran · 2023-01-21T00:16:30Z

@anthonycanino @tannergooding @BruceForstall

One open I have is regarding calculating 'cost'. See change here : 47a7424. I'm not sure how the heuristic works and if '+2' for the movup* makes sense

BruceForstall · 2023-01-23T04:14:58Z

One open I have is regarding calculating 'cost'. See change here : 47a7424. I'm not sure how the heuristic works and if '+2' for the movup* makes sense

I don't know about the cost details of the function you reference (maybe @AndyAyersMS could advise). However, don't SIMD64 and SIMD32 require similar kinds of codegen w.r.t. saving/restoring registers around a call?

I don't know if there are cases where SIMD64 should be given higher (more expensive) weights that SIMD32 just because there are twice as many bits in play (so presumably load/store is more expensive?).

BruceForstall

Looks like good progress.

src/coreclr/inc/corjit.h

src/coreclr/jit/compiler.h

src/coreclr/jit/emit.h

src/coreclr/jit/gentree.cpp

BruceForstall · 2023-01-23T04:46:24Z

src/coreclr/jit/typelist.h

@@ -62,6 +62,7 @@ DEF_TP(SIMD8    ,"simd8"  , TYP_SIMD8,   TI_STRUCT, 8, 8,  8,   2, 8, VTF_S)
 DEF_TP(SIMD12   ,"simd12" , TYP_SIMD12,  TI_STRUCT,12,16, 16,   4,16, VTF_S)
 DEF_TP(SIMD16   ,"simd16" , TYP_SIMD16,  TI_STRUCT,16,16, 16,   4,16, VTF_S)
 DEF_TP(SIMD32   ,"simd32" , TYP_SIMD32,  TI_STRUCT,32,32, 32,   8,16, VTF_S)
+DEF_TP(SIMD64   ,"simd64" , TYP_SIMD64,  TI_STRUCT,64,64, 64,   16,16, VTF_S)


@tannergooding It looks like TYP_SIMD32 is enabled for Arm64, and so are a lot of code paths related to it, e.g., in GenTreeVecCon. How much of the Arm64 compiler actually uses it, given that the maximum vector size is 16 bytes? I'm worried that TYP_SIMD64 will impose an even larger burden on Arm64 for no benefit.

Maybe there's not really much of a burden, except GenTreeVecCon is now a large size node, plus a bunch of unused code paths?

It's also a little surprising to me that SIMD32 alignment is 16, not 32, and SIMD64 is also 16, not 64, but the comment above indicates there's a reason for that (maybe it should point at code that changes the alignment dynamically).

We don’t actually use it to my knowledge and any handling should be superfluous

it likely just was never ifdefd out when Arm64 was brought online

I do not fully understand the question here

Do you mean the types SIMD32 and SIMD64 should not be defined for Arm64 at all? The following made sense in my head since both arm and x86 use the same calls to create GenTreeVecCon etc and if GenTreeVecCon is using simd64Val now since it's the largest, we need to do the same here :
https://github.com/dotnet/runtime/pull/80960/files#diff-547b75c8f8dd006693805c30d7a0e2ddcdd6b10691383dfbb16e3ed0c50f236aL1431-R1431 . When you say it doesn't need to be handled, do you mean the changes inside Lowering::LowerHWIntrinsicCreate above are not needed?

Not sure about the alignment being 16 there. That was another of my opens

Just saying that TYP_SIMD32/64 paths aren't ever used on arm64 and so we could ifdef them out and only have them on xarch.

I don't remember all the things that the alignment: 16 impacts. I believe it primarily impacts stack alignment and I don't believe we have the necessary support to change it to 32/64 today.

I think @DeepakRajendrakumaran doesn't need to worry about the impact of TYP_SIMD64 on Arm64 too much, but we'll see what the throughput (TP) measurements say about the actual impact. Perhaps we do need to go back and make sure Arm64 isn't unduly impacted in TP or JIT code size by unused functionality.

One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.

One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.

Each type is functionally that already.

That is, simd16_t is a union of each primitive type + 2x simd8_t
simd32_t is likewise of each primitive type + 4x simd8_t and 2x simd16_t

So by using simd64_t you have efficient access to just the simd8/16/32 data if necessary. The downside being that it declares 48 bytes of unused data on Arm64, and 32 bytes of unused data on hardware without AVX-512 support.

src/coreclr/jit/hwintrinsicxarch.cpp

src/coreclr/jit/lsra.cpp

src/coreclr/jit/valuenum.h

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/emit.cpp

src/coreclr/jit/gentree.cpp

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/hwintrinsiclistxarch.h

src/coreclr/jit/hwintrinsicxarch.cpp

src/coreclr/jit/instr.cpp

DeepakRajendrakumaran · 2023-01-24T18:33:08Z

While we wait for @AndyAyersMS to chime in, another question is re JITEEVersionIdentifier . I assume this needs to be updated as well? No 'new' methods are added but some changes were made

BruceForstall · 2023-01-24T23:31:42Z

another question is re JITEEVersionIdentifier

Yes, the change merits a JITEEVersion change.

However, I would not include that in the PR, if possible, until after all other testing has occurred. That's because it will make it impossible to run superpmi-diffs and superpmi-replay pipelines and see their results.

BruceForstall · 2023-01-25T18:48:27Z

Given the asserts I see in superpmi pipelines:

[02:38:39] ISSUE: <ASSERT> #1 D:\a\_work\1\s\src\coreclr\jit\compiler.cpp (2294) - Assertion failed 'instructionSetFlags.Equals(EnsureInstructionSetFlagsAreValid(instructionSetFlags))' in 'System.Runtime.CompilerServices.CastHelpers:StelemRef(System.Array,long,System.Object)' during 'Pre-import' (IL size 88; hash 0xdcdda805; Optimization-Level-Not-Yet-Set)

Maybe the existing SuperPMI collections (using the old JIT-EE interface, e.g., InstructionSet* definitions) isn't going to work.

I wonder if there's some way you can "hack" the change so it basically works with the "old" JIT-EE interface, just so we can leverage the superpmi pipelines (with existing superpmi collections) to measure throughput impact, especially on Arm64 or x64 without AVX-512 enabled, as well as asmdiffs on those platforms as well.

DeepakRajendrakumaran · 2023-01-25T19:40:08Z

Given the asserts I see in superpmi pipelines:
[02:38:39] ISSUE: <ASSERT> #1 D:\a\_work\1\s\src\coreclr\jit\compiler.cpp (2294) - Assertion failed 'instructionSetFlags.Equals(EnsureInstructionSetFlagsAreValid(instructionSetFlags))' in 'System.Runtime.CompilerServices.CastHelpers:StelemRef(System.Array,long,System.Object)' during 'Pre-import' (IL size 88; hash 0xdcdda805; Optimization-Level-Not-Yet-Set)
Maybe the existing SuperPMI collections (using the old JIT-EE interface, e.g., InstructionSet* definitions) isn't going to work.

I wonder if there's some way you can "hack" the change so it basically works with the "old" JIT-EE interface, just so we can leverage the superpmi pipelines (with existing superpmi collections) to measure throughput impact, especially on Arm64 or x64 without AVX-512 enabled, as well as asmdiffs on those platforms as well.

Appreciate the suggestion. I'm struggling with some of these failures. Will try this out :)

BruceForstall · 2023-01-25T20:56:49Z

Another alternative is that I could create a new SuperPMI collection using just the minimum required changes to the JIT-EE interface, including the JIT-EE GUID change. Then, you could include in this PR the very same GUID change, and it would pick up the new collection. This is a sometime time-consuming option, but let me know if you need to go this direction.

(This might be a little tricky as the replay job currently explicitly prevents being run if the JIT-EE GUID changes, bit it looks like the diffs job doesn't, for some reason)

tannergooding · 2023-01-25T21:00:22Z

and it would pick up the new collection

It would be great if we had an opt in job (azp run superpmi-diffs-full or something) which did spend the time to create a new collection and run diffs. This comes up just often enough that having such a job would save us time and resources for these kinds of changes.

BruceForstall · 2023-01-25T21:05:02Z

It would be great if we had an opt in job (azp run superpmi-diffs-full or something) which did spend the time to create a new collection and run diffs. This comes up just often enough that having such a job would save us time and resources for these kinds of changes.

Due to security reasons, we can't do superpmi collections on PRs and update our Azure Storage location where these are usually stored. We could possibly put the generated MCH files in Azure Storage PR artifact storage, and pull them from there, but that requires a bigger change in how our scripting works. Probably do-able, but even more work.

tannergooding · 2023-01-25T21:22:57Z

I was thinking just a local collection that is thrown away after the PR. I don't think it happens frequently enough that needing storage would be a concern.

DeepakRajendrakumaran · 2023-02-24T16:10:20Z

@tannergooding All requested changes should be there now.

tannergooding · 2023-02-24T16:10:51Z

Throughput numbers are a lot better now...

However, given the numbers also are impacting Arm64, I'd guess this is due to changes in the common code paths, maybe the TYP_SIMD64 and simd64_t handling logic.

@BruceForstall, what are your thoughts on taking this "as-is" and me getting up a PR that moves TYP_SIMD32/64 to be xarch only (as discussed here: #80960 (comment)) as a follow up?

I think the alternative is doing some more profiling "now" (maybe with Intel VTune or AMD uProf over crossgen2 of corelib) and trying to identify where the new hot spots are.

At a glance, we might want to make varTypeIsSIMD be return (TypeGet(vt) >= TYP_SIMD8) && (TypeGet(vt) <= TYP_SIMD64) (or similar) for example. We could alternatively have it do a varTypeClassification check like is done for Integral types.

Windows x64

MinOpts (+0.07% to +0.34%)
FullOpts (+0.13% to +0.19%)

Windows arm64

MinOpts (+0.11% to +0.25%)
FullOpts (+0.12% to +0.14%)

tannergooding · 2023-02-24T16:32:21Z

Put up #82616 to test the throughput impact of varTypeIsSIMD and having it use a varTypeClassification rather than a switch.

tannergooding · 2023-02-24T17:13:25Z

Put up #82624 to test the throughput impact of handling TYP_SIMD32 on Arm64

tannergooding · 2023-02-24T20:06:57Z

Restricting TYP_SIMD32 to Arm64 only represent an approx -0.09% to -0.19% improvement to Arm64

Fixing varTypeIsSimd is a -0.31% to -0.92% improvement to all architectures.

For Arm64 we'd likely see some additional gains for GT_CNS_VEC if we made it simd16_t only, but I opted to not do that yet given it would be much simpler with the template methods this PR added.

We'll still "regress" some things on xarch compared to not having TYP_SIMD64 at all, but it will overall still be a TP improvement compared to .NET 7

DeepakRajendrakumaran · 2023-02-24T20:34:24Z

Restricting TYP_SIMD32 to Arm64 only represent an approx -0.09% to -0.19% improvement to Arm64

Fixing varTypeIsSimd is a -0.31% to -0.92% improvement to all architectures.

For Arm64 we'd likely see some additional gains for GT_CNS_VEC if we made it simd16_t only, but I opted to not do that yet given it would be much simpler with the template methods this PR added.

We'll still "regress" some things on xarch compared to not having TYP_SIMD64 at all, but it will overall still be a TP improvement compared to .NET 7

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

tannergooding · 2023-02-24T20:41:07Z

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

Right. I think its worth doing so, at least so we can confirm no negative impact to Arm64. I can take care of resolving the conflicts, they should be relatively straightforward.

DeepakRajendrakumaran · 2023-02-24T21:57:47Z

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

Right. I think its worth doing so, at least so we can confirm no negative impact to Arm64. I can take care of resolving the conflicts, they should be relatively straightforward.

Sounds good. Would be nice to finally have this merged :)

tannergooding · 2023-02-24T22:59:21Z

With just the varTypeIsSIMD fix, we now have the below and so are already looking a lot better.

It would be interesting to see where the TP improvement is coming from, likely one of the suggested refactorings (maybe around using genLog2 or similar).

It does look like there are some x64 diffs around Vector512 methods. Looks like we're getting a call emitted where previously we werent...

Windows x64

MinOpts (-0.16% to -0.01%)
FullOpts (-0.01% to +0.02%)

Windows Arm64

MinOpts (-0.05% to -0.00%)
FullOpts (+0.00% to +0.01%)

tannergooding · 2023-02-25T02:39:45Z

Diffs

TP issues have been resolved. There is still a small up to +0.03% TP regression in full opts and up to a +0.01% TP regression in min opts. But inversely there is an up to -0.16% TP improvement to min opts as well.

The small diffs that have cropped up in Vector512.WithElement are a side effect of us recognizing Vector512 as a valid ISA class now and are exposing a pre-existing "bug" in that get_Count isn't getting recognized as an intrinsic when the respective Vector64/128/256/512 is "valid" but "unsupported" (the runtime behavior is still correct, its just an inlined call and not a constant node as intended).

I plan on giving this one more review pass myself, but it should generally be good and mergable after it gets sign-off from someone in @dotnet/jit-contrib

tannergooding

Changes overall LGTM.

I had left a few comments about some semi-related potential post PR cleanup that could happen. I'll work through those and get them completed.

DeepakRajendrakumaran · 2023-02-28T17:15:41Z

Changes overall LGTM.

I had left a few comments about some semi-related potential post PR cleanup that could happen. I'll work through those and get them completed.

Are we waiting on an additional review before merging this?

tannergooding · 2023-02-28T17:53:10Z

Are we waiting on an additional review before merging this?

Yes, it needs sign-off from someone that works on the JIT team. CC @dotnet/jit-contrib, @BruceForstall

DeepakRajendrakumaran · 2023-03-03T17:07:29Z

Bumping this since #82731 is dependent on this

CC https://github.com/orgs/dotnet/teams/jit-contrib, @BruceForstall

kunalspathak · 2023-03-04T03:18:36Z

I lost track of the discussion on this PR, but have we already tracked why there is TP impact for Arm64?

tannergooding · 2023-03-04T04:48:11Z

There were various optimization suggestions that apply to common code paths which give the improvements (namely the cases where we’re using genLog2 rather than memory lookups that span multiple cache lines)

There also remain a couple common code paths doing handling of EA32/64BYTE which aren’t ifdefd yet and could be and which can minorly negatively impact TP

DeepakRajendrakumaran requested a review from MichalStrehovsky as a code owner January 21, 2023 00:14

ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 21, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 21, 2023

teo-tsirpanis added the avx512 Related to the AVX-512 architecture label Jan 21, 2023

BruceForstall reviewed Jan 23, 2023

View reviewed changes