-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure Vector<T>.op_Multiply is handled as an intrinsic in appropriate cases #49503
Conversation
CC. @echesakovMSFT |
CC. @dotnet/jit-contrib |
GenTree** broadcastOp = nullptr; | ||
|
||
if (varTypeIsArithmetic(op1->TypeGet())) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64 you don't neet broadcast operation when either op1
or op2
are floating-point.
Since the operands will be in SIMD registers already you should use MultiplyByElement
(where elementIndex
is 0
) instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think you need to broadcast an integer value to the whole SIMD register.
You can use ins Vd.T[0], Xd
followed by mul Vd, Vn, Vd.T[0]
instead.
In this case, you can share some of the implementation for floating-point and integer types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should need a CreateScalarUnsafe
in this case instead then. I'll update to do the right thing.
- Notably this wasn't possible before the ARM HWIntrinsics were brought online, which is likely why Vector wasn't doing that already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably, MultiplyByScalar
and MultiplyBySelectedScalar
don't work for byte
/sbyte
and so those still need a broadcast.
case TYP_INT: | ||
case TYP_UINT: | ||
{ | ||
hwIntrinsic = NI_AVX2_MultiplyLow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouildn't this be guarded with compOpportunisticallyDependsOn(InstructionSet_AVX2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, its already asserted at the top of the method (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L392) as well as properly guarded further up in the stack.
We should never encounter SimdAsHWIntrinsicClassId::VectorT256
if AVX2 is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why at least AVX2 is required for VectorT256
? Shouldn't AVX be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector<T>
supports all 10 primitive types and can only properly accelerate the integer types with AVX2 so it was decided it is only 32-bytes if AVX2 is supported and is 16-bytes otherwise, for the best overall perf/experience.
47e1038
to
a79a1cb
Compare
@echesakovMSFT, any other feedback here or is it good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good.
Thank you
This resolves #49071 and resolves #30923 by ensuring
Vector<T> * T
is treated as an intrinsic so it generates the ideal codegen and not a for loop:.NET 5
Disassembly .NET 5.0.4 (5.0.421.11614), X64 RyuJIT
.NET 6
Disassembly .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
Notably it is still slightly slower. This appears to be largely due to additional
movaps
inserted before thebroadcast
instructions:This appears to be because we insert
Avx2.BroadcastScalarToVector256(Vector128.CreateScalarUnsafe(value))
during lowering. The register allocator then assigns a different register forCreateScalarUnsafe
even though its marked astgtPrefUse
(this appears to be related to this happening inside a loop).I'm investigating further to see if there is a trivial fix here (other than expanding
Vector256_Create
earlier, such as in the importer).