JIT: Faster vector == Vector128.Zero on arm64 #65632

EgorBo · 2022-02-20T13:52:48Z

static bool IsZero(Vector128<int> vec) => vec == Vector128<int>.Zero;

Codegen diff:

; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool
    stp     fp, lr, [sp,#-16]!
    mov     fp, sp
-   cmeq    v16.4s, v0.4s, #0
-   uminv   b16, v16.16b
-   umov    w0, v16.b[0]
+   umaxv   b16, v0.16b
+   umov    w0, v16.s[0]
    cmp     w0, #0
-   cset    x0, ne
+   cset    x0, eq
    ldp     fp, lr, [sp],#16
    ret     lr
-; Total bytes of code 36
+; Total bytes of code 32

This is needed for faster IndexOf from #63285. Also, #65288 relies on it.

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

Method	Toolchain	Pattern	Mean	Error	StdDev	Ratio
Count	/Core_Root_PR/corerun	Sherlock Holmes	68.51 us	0.331 us	0.294 us	1.00
Count	/Core_Root_base/corerun	Sherlock Holmes	72.90 us	1.352 us	1.265 us	1.06

Count	/Core_Root_PR/corerun	sherlock	56.52 us	0.226 us	0.200 us	1.00
Count	/Core_Root_base/corerun	sherlock	59.63 us	0.211 us	0.198 us	1.06

Count	/Core_Root_PR/corerun	zqj	54.03 us	0.257 us	0.241 us	1.00
Count	/Core_Root_base/corerun	zqj	57.08 us	0.188 us	0.176 us	1.06

Diffs

ghost · 2022-02-20T13:53:00Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #63829

static bool IsZero(Vector128<int> vec) => vec == Vector128<int>.Zero;

Codegen diff:

; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool
    stp     fp, lr, [sp,#-16]!
    mov     fp, sp
-   cmeq    v16.4s, v0.4s, #0
-   uminv   b16, v16.16b
+   umaxv   b16, v0.16b
    umov    w0, v16.b[0]
    cmp     w0, #0
-   cset    x0, ne
+   cset    x0, eq
    ldp     fp, lr, [sp],#16
    ret     lr
-; Total bytes of code 36
+; Total bytes of code 32

This is needed for faster IndexOf from #63285. Also, #65288 relies on it.

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

Method	Toolchain	Pattern	Mean	Error	StdDev	Ratio
Count	/Core_Root_PR/corerun	Sherlock Holmes	68.51 us	0.331 us	0.294 us	1.00
Count	/Core_Root_base/corerun	Sherlock Holmes	72.90 us	1.352 us	1.265 us	1.06

Count	/Core_Root_PR/corerun	sherlock	56.52 us	0.226 us	0.200 us	1.00
Count	/Core_Root_base/corerun	sherlock	59.63 us	0.211 us	0.198 us	1.06

Count	/Core_Root_PR/corerun	zqj	54.03 us	0.257 us	0.241 us	1.00
Count	/Core_Root_base/corerun	zqj	57.08 us	0.188 us	0.176 us	1.06

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2022-02-20T18:01:29Z

PTAL @echesakovMSFT @TIHan

EgorBo · 2022-02-20T18:24:03Z

cc @vargaz @fanyang-mono the Test I added in this PR crashes on Mono llvmaot Pri0 Runtime Tests Run Linux x64 release:

aot-compile: compiling /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero/CompareVectorWithZero.dll; MONO_PATH: /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero:/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root
2022-02-20T17:19:30.2982792Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero/CompareVectorWithZero.dll
2022-02-20T17:19:30.2986165Z   AOTID A5710361-7321-642D-C172-A16090734D6F
2022-02-20T17:19:30.2988419Z   * Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/simd-intrinsics.c:344
2022-02-20T17:19:30.2990056Z   
2022-02-20T17:19:30.3025150Z   
2022-02-20T17:19:30.3026584Z   =================================================================
2022-02-20T17:19:30.3061763Z   	Native Crash Reporting
2022-02-20T17:19:30.3083915Z   =================================================================
2022-02-20T17:19:30.3104269Z   Got a SIGABRT while executing native code. This usually indicates
2022-02-20T17:19:30.3910283Z   a fatal error in the mono runtime or one of the native libraries 
2022-02-20T17:19:30.3912329Z   
2022-02-20T17:19:30.3913499Z   =================================================================
2022-02-20T17:19:30.3914698Z   	External Debugger Dump:
2022-02-20T17:19:30.3916100Z   =================================================================
2022-02-20T17:19:30.4109204Z   used by your application.
2022-02-20T17:19:30.4111015Z   =================================================================
2022-02-20T17:19:30.4112230Z   
2022-02-20T17:19:30.4113797Z   =================================================================
2022-02-20T17:19:30.4115082Z   	Native stacktrace:
2022-02-20T17:19:30.4116339Z   =================================================================
2022-02-20T17:19:30.4118693Z   	0x7ff0567b89c2 - Unknown
2022-02-20T17:19:30.4120408Z   	0x7ff05675936e - Unknown
2022-02-20T17:19:30.4122083Z   	0x7ff0567b8298 - Unknown
2022-02-20T17:19:30.4138832Z   	0x7ff058a31630 - Unknown
2022-02-20T17:19:30.4140970Z   	0x7ff057e6a387 - Unknown
2022-02-20T17:19:30.4142772Z   	0x7ff057e6ba78 - Unknown
2022-02-20T17:19:30.4144476Z   	0x7ff056834875 - Unknown
2022-02-20T17:19:30.4146183Z   	0x7ff056646a33 - Unknown
2022-02-20T17:19:30.4147862Z   	0x7ff056834cdd - Unknown
2022-02-20T17:19:30.4149610Z   	0x7ff056834e45 - Unknown
2022-02-20T17:19:30.4151197Z   	0x7ff056834ea4 - Unknown
2022-02-20T17:19:30.4152791Z   	0x7ff05676c466 - Unknown
2022-02-20T17:19:30.4154346Z   	0x7ff05678bda3 - Unknown
2022-02-20T17:19:30.4156012Z   	0x7ff0566ddace - Unknown
2022-02-20T17:19:30.4159329Z   	0x7ff0566afec1 - Unknown
2022-02-20T17:19:30.4161158Z   	0x7ff056736e39 - Unknown
2022-02-20T17:19:30.4162645Z   	0x7ff056727b72 - Unknown
2022-02-20T17:19:30.4164267Z   	0x7ff0567195d2 - Unknown
2022-02-20T17:19:30.4165995Z   	0x7ff05679480e - Unknown
2022-02-20T17:19:30.4167524Z   	0x55dcb433a5aa - Unknown
2022-02-20T17:19:30.4169123Z   	0x7ff057e56555 - Unknown
2022-02-20T17:19:30.4170801Z   	0x55dcb4338029 - Unknown
2022-02-20T17:19:30.6293373Z   [New LWP 10203]
2022-02-20T17:19:30.6295390Z   [Thread debugging using libthread_db enabled]
2022-02-20T17:19:30.6296826Z   Using host libthread_db library "/lib64/libthread_db.so.1".
2022-02-20T17:19:31.2218749Z EXEC : warning : the debug information found in "/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so.dbg" does not match "/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so" (CRC mismatch). [/__w/1/s/src/mono/msbuild/aot-compile.proj]
2022-02-20T17:19:31.2222043Z   
2022-02-20T17:19:31.2239353Z   Missing separate debuginfo for /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2242226Z   Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/92/5d8bab89f15c6990ba944db4dfd44d746cfdb8.debug
2022-02-20T17:19:31.2395130Z   0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2412985Z     Id   Target Id         Frame 
2022-02-20T17:19:31.2414838Z     2    Thread 0x7ff054fff700 (LWP 10203) "SGen worker" 0x00007ff058a2da35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2432285Z   * 1    Thread 0x7ff059056740 (LWP 10202) "corerun" 0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2579202Z   
2022-02-20T17:19:31.2590000Z   Thread 2 (Thread 0x7ff054fff700 (LWP 10203)):
2022-02-20T17:19:31.2591755Z   #0  0x00007ff058a2da35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2593635Z   #1  0x00007ff05669f7f3 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2609971Z   #2  0x00007ff058a29ea5 in start_thread () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2611736Z   #3  0x00007ff057f329fd in clone () from /lib64/libc.so.6
2022-02-20T17:19:31.2623338Z   
2022-02-20T17:19:31.2633759Z   Thread 1 (Thread 0x7ff059056740 (LWP 10202)):
2022-02-20T17:19:31.2635970Z   #0  0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2637754Z   #1  0x00007ff0567b8b07 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2649895Z   #2  0x00007ff05675936e in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2659790Z   #3  0x00007ff0567b8298 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2661758Z   #4  <signal handler called>
2022-02-20T17:19:31.2663461Z   #5  0x00007ff057e6a387 in raise () from /lib64/libc.so.6
2022-02-20T17:19:31.2664790Z   #6  0x00007ff057e6ba78 in abort () from /lib64/libc.so.6
2022-02-20T17:19:31.2666396Z   #7  0x00007ff056834875 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2678513Z   #8  0x00007ff056646a33 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2685693Z   #9  0x00007ff056834cdd in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2698988Z   #10 0x00007ff056834e45 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2714534Z   #11 0x00007ff056834ea4 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2728485Z   #12 0x00007ff05676c466 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2742435Z   #13 0x00007ff05678bda3 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2755997Z   #14 0x00007ff0566ddace in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2769258Z   #15 0x00007ff0566afec1 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2783243Z   #16 0x00007ff056736e39 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2796491Z   #17 0x00007ff056727b72 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2809466Z   #18 0x00007ff0567195d2 in mono_main () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2823406Z   #19 0x00007ff05679480e in monovm_execute_assembly () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.3348964Z   #20 0x000055dcb433a5aa in run (config=...) at /__w/1/s/src/coreclr/hosts/corerun/corerun.cpp:368
2022-02-20T17:19:31.3363557Z   #21 main (argc=<optimized out>, argv=<optimized out>) at /__w/1/s/src/coreclr/hosts/corerun/corerun.cpp:563
2022-02-20T17:19:31.3425311Z   [Inferior 1 (process 10202) detached]
2022-02-20T17:19:31.3507046Z   
2022-02-20T17:19:31.3508333Z   =================================================================
2022-02-20T17:19:31.3509529Z   	Basic Fault Address Reporting
2022-02-20T17:19:31.3510661Z   =================================================================
2022-02-20T17:19:31.3512250Z   Memory around native instruction pointer (0x7ff057e6a387):0x7ff057e6a377  48 63 d7 48 63 f6 48 63 f9 b8 ea 00 00 00 0f 05  Hc.Hc.Hc........
2022-02-20T17:19:31.3513887Z   0x7ff057e6a387  48 3d 00 f0 ff ff 77 1e f3 c3 0f 1f 80 00 00 00  H=....w.........
2022-02-20T17:19:31.3515279Z   0x7ff057e6a397  00 85 c9 7f db 89 c8 f7 d8 81 e1 ff ff ff 7f 0f  ................
2022-02-20T17:19:31.3519503Z   0x7ff057e6a3a7  44 c6 89 c1 eb ca 48 8b 15 9c 0a 39 00 f7 d8 64  D.....H....9...d
2022-02-20T17:19:31.9097999Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/Arm/Rdm/Rdm_ro/Rdm_ro.dll
2022-02-20T17:19:31.9101723Z   AOTID 496F54B8-0F3D-AA06-C32F-9EB82E567870
2022-02-20T17:19:31.9104767Z   Executing opt: "/__w/1/s/artifacts/bin/mono/Linux.x64.Release/opt" -f -O2 -disable-tail-calls -place-safepoints -spp-all-backedges -mattr=sse4.2,popcnt,lzcnt,bmi,bmi2,pclmul,aes -o "mono_aot_6QfcWC/temp.opt.bc" "mono_aot_6QfcWC/temp.bc"
2022-02-20T17:19:31.9109761Z   Executing llc: "/__w/1/s/artifacts/bin/mono/Linux.x64.Release/llc"  -march=x86-64 -mcpu=generic -enable-implicit-null-checks -disable-fault-maps -asm-verbose=false -disable-gnu-eh-frame -enable-mono-eh-frame -mono-eh-frame-symbol=mono_aot_Rdm_ro_eh_frame -disable-tail-calls -no-x86-call-frame-opt -relocation-model=pic -filetype=obj -mattr=sse4.2,popcnt,lzcnt,bmi,bmi2,pclmul,aes -o "mono_aot_6QfcWC/temp-llvm.o" "mono_aot_6QfcWC/temp.opt.bc"
2022-02-20T17:19:31.9113118Z   Compiled: 2033/2033
2022-02-20T17:19:31.9115090Z   Executing the native assembler: "as" --64  -o /tmp/mono_aot_Af32J0.o /tmp/mono_aot_Af32J0
2022-02-20T17:19:31.9118024Z   Executing the native linker: "ld" -shared -o /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/Arm/Rdm/Rdm_ro/Rdm_ro.dll.so.tmp "mono_aot_6QfcWC/temp-llvm.o" /tmp/mono_aot_Af32J0.o

vargaz · 2022-02-20T20:09:55Z

This will fix it:
https://gist.github.com/vargaz/78f0b5d0710c5de7a0131b0f9a6ea5d3

…t-cmp-zero-vec

src/coreclr/jit/lowerarmarch.cpp

TIHan

Looks good! Only a few comments.

fanyang-mono · 2022-02-22T15:32:42Z

I had created an issue to clean up the code for type checks of vector elements. Haven't get to it yet. (#65318)

echesakov

Left some comments

src/coreclr/jit/lowerarmarch.cpp

echesakov · 2022-02-22T18:56:41Z

src/coreclr/jit/lowerarmarch.cpp

+    if (!varTypeIsFloating(simdBaseType) && (op != nullptr))
+    {
+        GenTree* cmp =
+            comp->gtNewSimdHWIntrinsicNode(simdType, op, NI_AdvSimd_Arm64_MaxAcross, CORINFO_TYPE_UBYTE, simdSize);


According to Arm® Cortex®-A76 Software Optimization Guide:
UMAXV, 16B has Exec latency 6 and Execution throughput 1/2
while UMAXV, 4H/4S has Exec latency 3 and Execution throughput 1

Do we want CORINFO_TYPE_USHORT/CORINFO_TYPE_UINT as a base type instead?

Sure, let's change, Although, I had this in mind when I was benchmarking it and saw zero difference, but I now I see why:

(Apple M1, Firestorm core)

…t-cmp-zero-vec

kunalspathak · 2022-03-03T17:26:16Z

Improvements in dotnet/perf-autofiling-issues#3833 and dotnet/perf-autofiling-issues#3829

EgorBo · 2022-03-03T17:56:51Z

wow, it's more than I expected

EgorBo added 2 commits February 20, 2022 14:48

Optimize vec == Vector.Zero for arm64

466afe6

Clean up

3139218

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2022

ghost assigned EgorBo Feb 20, 2022

EgorBo added 3 commits February 20, 2022 16:54

Update lowerarmarch.cpp

a4143c4

Update lowerarmarch.cpp

f9d45e0

Update lowerarmarch.cpp

51f727f

EgorBo added 2 commits February 21, 2022 15:21

Apply Zoltan's patch

70e18b7

Merge branch 'main' of https://github.com/dotnet/runtime into arm-fas…

1550865

…t-cmp-zero-vec

EgorBo requested review from vargaz, lambdageek, SamMonoRT and imhameed as code owners February 21, 2022 12:21

TIHan reviewed Feb 21, 2022

View reviewed changes

src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved

TIHan reviewed Feb 21, 2022

View reviewed changes

src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved

TIHan approved these changes Feb 21, 2022

View reviewed changes

Address feedback

c1c7831

echesakov reviewed Feb 22, 2022

View reviewed changes

EgorBo added 3 commits February 22, 2022 23:41

Merge branch 'main' of https://github.com/dotnet/runtime into arm-fas…

5c1a77e

…t-cmp-zero-vec

Address feedback

59609c4

use UINT for V128

bc9220b

echesakov approved these changes Feb 22, 2022

View reviewed changes

EgorBo merged commit 3ef6660 into dotnet:main Feb 23, 2022

This was referenced Feb 23, 2022

Arm64: Improve code generation for Vector<T> comparision #31685

Closed

[Perf] Changes at 2/23/2022 8:21:20 AM dotnet/perf-autofiling-issues#3698

Closed

EgorBo deleted the arm-fast-cmp-zero-vec branch March 3, 2022 17:56

JulieLeeMSFT mentioned this pull request Apr 1, 2022

What's new in .NET 7 Preview 3 [WIP] dotnet/core#7108

Closed

ghost locked as resolved and limited conversation to collaborators Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Faster vector == Vector128.Zero on arm64 #65632

JIT: Faster vector == Vector128.Zero on arm64 #65632

EgorBo commented Feb 20, 2022 •

edited

Loading

ghost commented Feb 20, 2022

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

EgorBo commented Feb 20, 2022

EgorBo commented Feb 20, 2022

vargaz commented Feb 20, 2022

TIHan left a comment

fanyang-mono commented Feb 22, 2022 •

edited

Loading

echesakov left a comment

echesakov Feb 22, 2022

EgorBo Feb 22, 2022

kunalspathak commented Mar 3, 2022 •

edited

Loading

EgorBo commented Mar 3, 2022

JIT: Faster vector == Vector128.Zero on arm64 #65632

JIT: Faster vector == Vector128.Zero on arm64 #65632

Conversation

EgorBo commented Feb 20, 2022 • edited Loading

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

ghost commented Feb 20, 2022

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

EgorBo commented Feb 20, 2022

EgorBo commented Feb 20, 2022

vargaz commented Feb 20, 2022

TIHan left a comment

Choose a reason for hiding this comment

fanyang-mono commented Feb 22, 2022 • edited Loading

echesakov left a comment

Choose a reason for hiding this comment

echesakov Feb 22, 2022

Choose a reason for hiding this comment

EgorBo Feb 22, 2022

Choose a reason for hiding this comment

kunalspathak commented Mar 3, 2022 • edited Loading

EgorBo commented Mar 3, 2022

EgorBo commented Feb 20, 2022 •

edited

Loading

fanyang-mono commented Feb 22, 2022 •

edited

Loading

kunalspathak commented Mar 3, 2022 •

edited

Loading