Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

tannergooding · 2020-01-03T17:51:55Z

This updates the Vector128.AsVector* and Vector256.AsVector* methods to be intrinsic so they can be low/zero cost where applicable.

src/coreclr/src/jit/hwintrinsicxarch.cpp

tannergooding · 2020-01-03T17:55:13Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+
+        case NI_Vector128_AsVector2:
+        case NI_Vector128_AsVector3:
+        case NI_Vector128_AsVector4:


AsVector4 is the simplest, we have a TYP_SIMD16 and are going to a TYP_SIMD16

AsVector2 and AsVector3 should also be simple. We have a TYP_SIMD16 and we want a TYP_SIMD8 or TYP_SIMD12. If they are already in register, there is no change needed. If they are in memory, we have 16-bytes and are correctly aligned, so it should likewise be safe to just access.

If they are already in register, there is no change needed.

Right, but we should ensure that they get the right class layout.

If they are already in register, there is no change needed

I'm not so sure. Some SIMD8/12 operations intentionally zero out the upper unused elements (SIMDIntrinsicDiv and maybe others). Either that zeroing is unnecessary or AsVector2/3 should also zero out.

I believe that's just to avoid the penalty around x / 0 though rather than a functional requirement.

We may want to have both options a "simple, but correct" API. That is, one which ensures that the upper bits are correctly zeroed
and a "fast, but unsafe" API. That is, one which leaves clearing/setting the upper bits to the user; that way the conversion is as low as possible (generally free).

@CarolEidt, do you have an opinion here?

Should I, when converting Vector128<float> to Vector2/3, explicitly zero the untracked z/w fields or should that responsibility be left up to the consumer if they feel it is important?

As best as I can tell, it doesn't matter from a functional perspective and at worst could cause a minor perf hit if those fields end up containing a subnormal value or NaN.

No, dpps allows you to specify which elements are consumed and which are ignored as part of the computation.

Looks like we might need to update the codegen for Vector2.Dot however. It currently generates the same as Vector4 using 0xF1 as the control word and it should be 0x3F or 0x31. Vector3.Dot already correctly uses 0x71

Yeah, just saw that. The SSE2 fallback for Vector2 is also using the Vector4 code from what I can tell

Sounds like some room for improvement there then. I'll take a look at a few other cases and will log bugs to start fixing them.

I reverted the change treating AsVector2/AsVector3 as intrinsic for right now since it requires some fixes to the SIMD code and since it looks like it requires a few fixes elsewhere in the JIT as well.

tannergooding · 2020-01-03T17:57:54Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+            break;
+        }
+
+        case NI_Vector128_AsVector128:


AsVector128 is a bit more complex.

For Vector4 to Vector128 it should be simple as it is TYP_SIMD16 to TYP_SIMD16

The other two (which are currently unhandled) I'd like some input on how to handle.
If they are in register, it can be a nop. Likewise, if they were spilled to the stack, they can be a nop.
However, if it is a field or an array element, we have to read 8 or 12-bytes exactly (I probably need to propose Unsafe versions so we don't have to zero-init the upper element as well).

Should we carry the NI_Vector128_AsVector128 through to codegen and decide what to do there (nop vs actual read, depending on location) or is there something clever we can do earlier in the compilation?

CC. @CarolEidt for input here

It would be nice to be able to handle this in a "generic" way - that is, using a CAST from one type to another that does the appropriate widening. However, I don't think the current handling of casts is "generic" in a good way. So, I'd probably vote for carrying the conversion through to Lowering and codegen.

that is, using a CAST from one type to another that does the appropriate widening.

Such a cast shouldn't be needed, much like we don't need a cast from short to int after a short indir. A SIMD12 indir really produces a SIMD16 value.

I'm going to leave this to a future PR for now. Vector2/3 to Vector128<float> is going to be left as a managed implementation which explicitly zeroes the unused upper elements.

In particular I want to get some more testing and samples done for this and don't want to block the other changes.

tannergooding · 2020-01-03T17:59:19Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+            break;
+        }
+
+        case NI_Vector256_AsVector:


AsVector is also fairly simple.

If we don't support AVX, we just go to the software fallback
If we do support AVX, but not AVX2 (meaning Vector<T> is 16-bytes) then we just need to GetLower (which is generally a nop, as we are either in register or a memory location that is correctly aligned and at least 16-bytes)

tannergooding · 2020-01-03T18:01:26Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+        }
+
+        case NI_Vector256_AsVector:
+        case NI_Vector256_AsVector256:


AsVector256 (which is Vector<T> to Vector256<T>) is similar.

If we don't support AVX, we just go to the software fallback
If we support AVX2 (meaning Vector<T> is 32-bytes) then it is a nop
If we only support AVX we just set the upper 128-bits to zero via ToVector256 (likewise, we probably want an Unsafe API so users can get non-deterministic bits, as we've done for other apis).

users can get non-deterministic bits

What does that mean? Who needs non-deterministic bits? This isn't a random number generator.

The intent is to provide an API which provides fast conversion and requires the user to manage the upper bits themselves.

Similarly to how we have both ToVector256 and ToVector256Unsafe. The former explicitly zeros the upper 128-bits and the latter is an effective nop that just allows access to the upper 128-bits of the register.

tannergooding · 2020-01-07T21:50:05Z

CC. @dotnet/jit-contrib

tannergooding · 2020-01-10T23:57:46Z

Just rebased onto current head. Working on resolving feedback given so far.

tannergooding · 2020-01-17T03:29:43Z

The current changes results in the following diff (slimmed down to just the modified tests, there were no changes to other tests or framework/runtime assemblies):

Found 8 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -2456 (-0.14% of base)
    diff is an improvement.

Top file improvements (bytes):
        -672 : Vector128_1_r.dasm (-0.14% of base)
        -660 : Vector256_1_r.dasm (-0.12% of base)
        -624 : Vector128_1_ro.dasm (-0.19% of base)
        -500 : Vector256_1_ro.dasm (-0.13% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method regressions (bytes):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (bytes):
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this

Top method regressions (percentages):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (percentages):
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this

42 total methods with Code Size differences (41 improved, 1 regressed), 3634 unchanged.

The diffs are essentially all cases like the following:

-       vmovaps  qword ptr [rsp+A0H], xmm6
-       xor      rax, rax
-       mov      qword ptr [rsp+80H], rax
-       mov      qword ptr [rsp+88H], rax
-       mov      qword ptr [rsp+90H], rax
-       mov      qword ptr [rsp+98H], rax
+       vmovaps  qword ptr [rsp+70H], xmm6
+       vmovaps  qword ptr [rsp+60H], xmm7

+       vextractf128 ymm7, ymm6, 1
        call     VectorAs__AsVectorUInt64:ValidateResult(Vector`1,Vector128`1,String):this
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+60H], ymm0
-       vmovupd  xmm6, xmmword ptr [rsp+60H]
-       vmovapd  xmmword ptr [rsp+50H], xmm6
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+20H], ymm0
+       vinsertf128 ymm6, ymm6, ymm7, 1
+       vmovaps  ymm0, ymm6
+       vmovapd  xmmword ptr [rsp+20H], xmm0
+       vmovupd  ymmword ptr[rsp+30H], ymm6

… vector conversions

tannergooding · 2020-01-22T21:08:27Z

This should be ready for review now.

I did not treat the currently problematic cases (Vector2/3 to/from Vector128) as intrinsic for now. The other cases (Vector to/from Vector128/256 and Vector4 to/from Vector128) are handled as intrinsics.

CarolEidt

I think it's best to avoid the unreached asserts here, but I see that they're all over the place, even in places where there might be a reasonable default action. I'll submit a separate issue to review this.

CarolEidt · 2020-01-23T23:44:58Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+        {
+            // TYP_SIMD8 and TYP_SIMD12 currently only expose "safe" versions
+            // which zero the upper elements and so are implemented in managed.
+            unreached();


This will expand to a noway_assert I believe. I think it's best to avoid these where there's a valid default action to take, especially in a non-optimizing phase. It will cause the compile to be re-tried with MinOpts. I think the better option for these cases is to have an assert such as assert(!"Unexpected intrinsic in impBaseIntrinsic"); followed by a break, which I believe will cause it to return nullptr as for an unsupported or unrecognized intrinsic.

Seems reasonable. I'm going to go ahead and merge the PR as is and then will follow it up with a PR that adjusts the unreached() calls I can find in the hwintrinsic* files

tannergooding commented Jan 3, 2020

View reviewed changes

src/coreclr/src/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

tannergooding commented Jan 3, 2020

View reviewed changes

danmoseley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 3, 2020

tannergooding force-pushed the as-vector branch 2 times, most recently from e0cda97 to 4dc66de Compare January 6, 2020 21:21

tannergooding force-pushed the as-vector branch from f4455fb to 861a68f Compare January 10, 2020 23:57

tannergooding added 4 commits January 22, 2020 10:11

Recognize the Vector128/256.AsVector* methods as intrinsic

ddd9d6d

Ensure we normalize the struct handle for S.Numerics to S.R.Intrinsic…

1acb4fd

… vector conversions

Marking the Vector128/256.As* methods as intrinsic

c518d93

Don't handle AsVector2 or AsVector3 as intrinsic right now

ff37586

tannergooding force-pushed the as-vector branch from 7cf5f92 to ff37586 Compare January 22, 2020 18:11

tannergooding marked this pull request as ready for review January 22, 2020 21:07

tannergooding changed the title ~~[WIP] Recognize the Vector128/256.AsVector* methods as intrinsic~~ Recognize the Vector128/256.AsVector* methods as intrinsic Jan 22, 2020

CarolEidt approved these changes Jan 23, 2020

View reviewed changes

CarolEidt mentioned this pull request Jan 24, 2020

Review use of unreached() in the JIT #2120

Closed

tannergooding merged commit cec14a8 into dotnet:master Jan 24, 2020

tannergooding mentioned this pull request Jan 25, 2020

Assertion failed 'localVarInterval->assignedReg->isCalleeSave' #2181

Closed

saucecontrol mentioned this pull request Feb 7, 2020

changes based on Issue #28731 ((System.Numerics) Cross Product for Ve… #31884

Closed

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

tannergooding commented Jan 3, 2020

tannergooding Jan 3, 2020

CarolEidt Jan 4, 2020

mikedn Jan 7, 2020

tannergooding Jan 7, 2020

tannergooding Jan 16, 2020

tannergooding Jan 16, 2020

tannergooding Jan 16, 2020 •

edited

Loading

saucecontrol Jan 16, 2020

tannergooding Jan 16, 2020

tannergooding Jan 17, 2020

tannergooding Jan 3, 2020

tannergooding Jan 3, 2020

CarolEidt Jan 7, 2020

mikedn Jan 7, 2020

tannergooding Jan 16, 2020

tannergooding Jan 3, 2020

tannergooding Jan 3, 2020

mikedn Jan 7, 2020

tannergooding Jan 7, 2020

tannergooding commented Jan 7, 2020

tannergooding commented Jan 10, 2020

tannergooding commented Jan 17, 2020

tannergooding commented Jan 22, 2020

CarolEidt left a comment

CarolEidt Jan 23, 2020

tannergooding Jan 24, 2020

Recognize the Vector128/256.AsVector* methods as intrinsic #1280

Recognize the Vector128/256.AsVector* methods as intrinsic #1280

Conversation

tannergooding commented Jan 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Jan 7, 2020

tannergooding commented Jan 10, 2020

tannergooding commented Jan 17, 2020

tannergooding commented Jan 22, 2020

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

tannergooding Jan 16, 2020 •

edited

Loading