-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Faster vector == Vector128.Zero on arm64 #65632
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsCloses #63829 static bool IsZero(Vector128<int> vec) => vec == Vector128<int>.Zero; Codegen diff: ; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool
stp fp, lr, [sp,#-16]!
mov fp, sp
- cmeq v16.4s, v0.4s, #0
- uminv b16, v16.16b
+ umaxv b16, v0.16b
umov w0, v16.b[0]
cmp w0, #0
- cset x0, ne
+ cset x0, eq
ldp fp, lr, [sp],#16
ret lr
-; Total bytes of code 36
+; Total bytes of code 32 This is needed for faster IndexOf from #63285. Also, #65288 relies on it. Perf_Regex_Industry_RustLang_Sherlock Benchmark:
|
PTAL @echesakovMSFT @TIHan |
cc @vargaz @fanyang-mono the Test I added in this PR crashes on
|
This will fix it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Only a few comments.
I had created an issue to clean up the code for type checks of vector elements. Haven't get to it yet. (#65318) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
src/coreclr/jit/lowerarmarch.cpp
Outdated
if (!varTypeIsFloating(simdBaseType) && (op != nullptr)) | ||
{ | ||
GenTree* cmp = | ||
comp->gtNewSimdHWIntrinsicNode(simdType, op, NI_AdvSimd_Arm64_MaxAcross, CORINFO_TYPE_UBYTE, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Arm® Cortex®-A76 Software Optimization Guide:
UMAXV, 16B
has Exec latency 6
and Execution throughput 1/2
while UMAXV, 4H/4S
has Exec latency 3
and Execution throughput 1
Do we want CORINFO_TYPE_USHORT
/CORINFO_TYPE_UINT
as a base type instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvements in dotnet/perf-autofiling-issues#3833 and dotnet/perf-autofiling-issues#3829 |
wow, it's more than I expected |
Closes #63829
Codegen diff:
This is needed for faster IndexOf from #63285. Also, #65288 relies on it.
Perf_Regex_Industry_RustLang_Sherlock Benchmark:
Diffs