-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognize the Vector128/256.AsVector*
methods as intrinsic
#1280
Conversation
|
||
case NI_Vector128_AsVector2: | ||
case NI_Vector128_AsVector3: | ||
case NI_Vector128_AsVector4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsVector4
is the simplest, we have a TYP_SIMD16
and are going to a TYP_SIMD16
AsVector2
and AsVector3
should also be simple. We have a TYP_SIMD16
and we want a TYP_SIMD8
or TYP_SIMD12
. If they are already in register, there is no change needed. If they are in memory, we have 16-bytes and are correctly aligned, so it should likewise be safe to just access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are already in register, there is no change needed.
Right, but we should ensure that they get the right class layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are already in register, there is no change needed
I'm not so sure. Some SIMD8/12 operations intentionally zero out the upper unused elements (SIMDIntrinsicDiv and maybe others). Either that zeroing is unnecessary or AsVector2/3 should also zero out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that's just to avoid the penalty around x / 0
though rather than a functional requirement.
We may want to have both options a "simple, but correct" API. That is, one which ensures that the upper bits are correctly zeroed
and a "fast, but unsafe" API. That is, one which leaves clearing/setting the upper bits to the user; that way the conversion is as low as possible (generally free).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt, do you have an opinion here?
Should I, when converting Vector128<float>
to Vector2/3
, explicitly zero the untracked z
/w
fields or should that responsibility be left up to the consumer if they feel it is important?
As best as I can tell, it doesn't matter from a functional perspective and at worst could cause a minor perf hit if those fields end up containing a subnormal value or NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, dpps
allows you to specify which elements are consumed and which are ignored as part of the computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we might need to update the codegen for Vector2.Dot
however. It currently generates the same as Vector4
using 0xF1
as the control word and it should be 0x3F
or 0x31
. Vector3.Dot
already correctly uses 0x71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just saw that. The SSE2 fallback for Vector2
is also using the Vector4
code from what I can tell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like some room for improvement there then. I'll take a look at a few other cases and will log bugs to start fixing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the change treating AsVector2
/AsVector3
as intrinsic for right now since it requires some fixes to the SIMD code and since it looks like it requires a few fixes elsewhere in the JIT as well.
break; | ||
} | ||
|
||
case NI_Vector128_AsVector128: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsVector128
is a bit more complex.
For Vector4
to Vector128
it should be simple as it is TYP_SIMD16
to TYP_SIMD16
The other two (which are currently unhandled) I'd like some input on how to handle.
If they are in register, it can be a nop. Likewise, if they were spilled to the stack, they can be a nop.
However, if it is a field or an array element, we have to read 8 or 12-bytes exactly (I probably need to propose Unsafe
versions so we don't have to zero-init
the upper element as well).
Should we carry the NI_Vector128_AsVector128
through to codegen and decide what to do there (nop vs actual read, depending on location) or is there something clever we can do earlier in the compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC. @CarolEidt for input here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to handle this in a "generic" way - that is, using a CAST
from one type to another that does the appropriate widening. However, I don't think the current handling of casts is "generic" in a good way. So, I'd probably vote for carrying the conversion through to Lowering
and codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is, using a CAST from one type to another that does the appropriate widening.
Such a cast shouldn't be needed, much like we don't need a cast from short
to int
after a short
indir. A SIMD12 indir really produces a SIMD16 value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this to a future PR for now. Vector2/3
to Vector128<float>
is going to be left as a managed implementation which explicitly zeroes the unused upper elements.
In particular I want to get some more testing and samples done for this and don't want to block the other changes.
break; | ||
} | ||
|
||
case NI_Vector256_AsVector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsVector
is also fairly simple.
If we don't support AVX, we just go to the software fallback
If we do support AVX, but not AVX2 (meaning Vector<T>
is 16-bytes
) then we just need to GetLower
(which is generally a nop, as we are either in register or a memory location that is correctly aligned and at least 16-bytes)
} | ||
|
||
case NI_Vector256_AsVector: | ||
case NI_Vector256_AsVector256: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsVector256
(which is Vector<T>
to Vector256<T>
) is similar.
If we don't support AVX, we just go to the software fallback
If we support AVX2 (meaning Vector<T>
is 32-bytes
) then it is a nop
If we only support AVX
we just set the upper 128-bits to zero via ToVector256
(likewise, we probably want an Unsafe
API so users can get non-deterministic bits, as we've done for other apis).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users can get non-deterministic bits
What does that mean? Who needs non-deterministic bits? This isn't a random number generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to provide an API which provides fast conversion and requires the user to manage the upper bits themselves.
Similarly to how we have both ToVector256
and ToVector256Unsafe
. The former explicitly zeros the upper 128-bits and the latter is an effective nop that just allows access to the upper 128-bits of the register.
e0cda97
to
4dc66de
Compare
CC. @dotnet/jit-contrib |
f4455fb
to
861a68f
Compare
Just rebased onto current head. Working on resolving feedback given so far. |
The current changes results in the following diff (slimmed down to just the modified tests, there were no changes to other tests or framework/runtime assemblies):
The diffs are essentially all cases like the following: - vmovaps qword ptr [rsp+A0H], xmm6
- xor rax, rax
- mov qword ptr [rsp+80H], rax
- mov qword ptr [rsp+88H], rax
- mov qword ptr [rsp+90H], rax
- mov qword ptr [rsp+98H], rax
+ vmovaps qword ptr [rsp+70H], xmm6
+ vmovaps qword ptr [rsp+60H], xmm7 + vextractf128 ymm7, ymm6, 1
call VectorAs__AsVectorUInt64:ValidateResult(Vector`1,Vector128`1,String):this
- vmovupd ymm0, ymmword ptr[rsp+80H]
- vmovupd ymmword ptr[rsp+60H], ymm0
- vmovupd xmm6, xmmword ptr [rsp+60H]
- vmovapd xmmword ptr [rsp+50H], xmm6
- vmovupd ymm0, ymmword ptr[rsp+80H]
- vmovupd ymmword ptr[rsp+20H], ymm0
+ vinsertf128 ymm6, ymm6, ymm7, 1
+ vmovaps ymm0, ymm6
+ vmovapd xmmword ptr [rsp+20H], xmm0
+ vmovupd ymmword ptr[rsp+30H], ymm6 |
7cf5f92
to
ff37586
Compare
Vector128/256.AsVector*
methods as intrinsicVector128/256.AsVector*
methods as intrinsic
This should be ready for review now. I did not treat the currently problematic cases (Vector2/3 to/from Vector128) as intrinsic for now. The other cases (Vector to/from Vector128/256 and Vector4 to/from Vector128) are handled as intrinsics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to avoid the unreached
asserts here, but I see that they're all over the place, even in places where there might be a reasonable default action. I'll submit a separate issue to review this.
{ | ||
// TYP_SIMD8 and TYP_SIMD12 currently only expose "safe" versions | ||
// which zero the upper elements and so are implemented in managed. | ||
unreached(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will expand to a noway_assert
I believe. I think it's best to avoid these where there's a valid default action to take, especially in a non-optimizing phase. It will cause the compile to be re-tried with MinOpts. I think the better option for these cases is to have an assert such as assert(!"Unexpected intrinsic in impBaseIntrinsic");
followed by a break
, which I believe will cause it to return nullptr
as for an unsupported or unrecognized intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. I'm going to go ahead and merge the PR as is and then will follow it up with a PR that adjusts the unreached()
calls I can find in the hwintrinsic*
files
This updates the
Vector128.AsVector*
andVector256.AsVector*
methods to be intrinsic so they can be low/zero cost where applicable.