Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Faster vector == Vector128.Zero on arm64 #65632

Merged
merged 11 commits into from
Feb 23, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 20, 2022

Closes #63829

static bool IsZero(Vector128<int> vec) => vec == Vector128<int>.Zero;

Codegen diff:

; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool
    stp     fp, lr, [sp,#-16]!
    mov     fp, sp
-   cmeq    v16.4s, v0.4s, #0
-   uminv   b16, v16.16b
-   umov    w0, v16.b[0]
+   umaxv   b16, v0.16b
+   umov    w0, v16.s[0]
    cmp     w0, #0
-   cset    x0, ne
+   cset    x0, eq
    ldp     fp, lr, [sp],#16
    ret     lr
-; Total bytes of code 36
+; Total bytes of code 32

This is needed for faster IndexOf from #63285. Also, #65288 relies on it.

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

Method Toolchain Pattern Mean Error StdDev Ratio
Count /Core_Root_PR/corerun Sherlock Holmes 68.51 us 0.331 us 0.294 us 1.00
Count /Core_Root_base/corerun Sherlock Holmes 72.90 us 1.352 us 1.265 us 1.06
Count /Core_Root_PR/corerun sherlock 56.52 us 0.226 us 0.200 us 1.00
Count /Core_Root_base/corerun sherlock 59.63 us 0.211 us 0.198 us 1.06
Count /Core_Root_PR/corerun zqj 54.03 us 0.257 us 0.241 us 1.00
Count /Core_Root_base/corerun zqj 57.08 us 0.188 us 0.176 us 1.06

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2022
@ghost ghost assigned EgorBo Feb 20, 2022
@ghost
Copy link

ghost commented Feb 20, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #63829

static bool IsZero(Vector128<int> vec) => vec == Vector128<int>.Zero;

Codegen diff:

; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool
    stp     fp, lr, [sp,#-16]!
    mov     fp, sp
-   cmeq    v16.4s, v0.4s, #0
-   uminv   b16, v16.16b
+   umaxv   b16, v0.16b
    umov    w0, v16.b[0]
    cmp     w0, #0
-   cset    x0, ne
+   cset    x0, eq
    ldp     fp, lr, [sp],#16
    ret     lr
-; Total bytes of code 36
+; Total bytes of code 32

This is needed for faster IndexOf from #63285. Also, #65288 relies on it.

Perf_Regex_Industry_RustLang_Sherlock Benchmark:

Method Toolchain Pattern Mean Error StdDev Ratio
Count /Core_Root_PR/corerun Sherlock Holmes 68.51 us 0.331 us 0.294 us 1.00
Count /Core_Root_base/corerun Sherlock Holmes 72.90 us 1.352 us 1.265 us 1.06
Count /Core_Root_PR/corerun sherlock 56.52 us 0.226 us 0.200 us 1.00
Count /Core_Root_base/corerun sherlock 59.63 us 0.211 us 0.198 us 1.06
Count /Core_Root_PR/corerun zqj 54.03 us 0.257 us 0.241 us 1.00
Count /Core_Root_base/corerun zqj 57.08 us 0.188 us 0.176 us 1.06
Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Feb 20, 2022

PTAL @echesakovMSFT @TIHan

@EgorBo
Copy link
Member Author

EgorBo commented Feb 20, 2022

cc @vargaz @fanyang-mono the Test I added in this PR crashes on Mono llvmaot Pri0 Runtime Tests Run Linux x64 release:

aot-compile: compiling /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero/CompareVectorWithZero.dll; MONO_PATH: /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero:/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root
2022-02-20T17:19:30.2982792Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/General/HwiOp/CompareVectorWithZero/CompareVectorWithZero.dll
2022-02-20T17:19:30.2986165Z   AOTID A5710361-7321-642D-C172-A16090734D6F
2022-02-20T17:19:30.2988419Z   * Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/simd-intrinsics.c:344
2022-02-20T17:19:30.2990056Z   
2022-02-20T17:19:30.3025150Z   
2022-02-20T17:19:30.3026584Z   =================================================================
2022-02-20T17:19:30.3061763Z   	Native Crash Reporting
2022-02-20T17:19:30.3083915Z   =================================================================
2022-02-20T17:19:30.3104269Z   Got a SIGABRT while executing native code. This usually indicates
2022-02-20T17:19:30.3910283Z   a fatal error in the mono runtime or one of the native libraries 
2022-02-20T17:19:30.3912329Z   
2022-02-20T17:19:30.3913499Z   =================================================================
2022-02-20T17:19:30.3914698Z   	External Debugger Dump:
2022-02-20T17:19:30.3916100Z   =================================================================
2022-02-20T17:19:30.4109204Z   used by your application.
2022-02-20T17:19:30.4111015Z   =================================================================
2022-02-20T17:19:30.4112230Z   
2022-02-20T17:19:30.4113797Z   =================================================================
2022-02-20T17:19:30.4115082Z   	Native stacktrace:
2022-02-20T17:19:30.4116339Z   =================================================================
2022-02-20T17:19:30.4118693Z   	0x7ff0567b89c2 - Unknown
2022-02-20T17:19:30.4120408Z   	0x7ff05675936e - Unknown
2022-02-20T17:19:30.4122083Z   	0x7ff0567b8298 - Unknown
2022-02-20T17:19:30.4138832Z   	0x7ff058a31630 - Unknown
2022-02-20T17:19:30.4140970Z   	0x7ff057e6a387 - Unknown
2022-02-20T17:19:30.4142772Z   	0x7ff057e6ba78 - Unknown
2022-02-20T17:19:30.4144476Z   	0x7ff056834875 - Unknown
2022-02-20T17:19:30.4146183Z   	0x7ff056646a33 - Unknown
2022-02-20T17:19:30.4147862Z   	0x7ff056834cdd - Unknown
2022-02-20T17:19:30.4149610Z   	0x7ff056834e45 - Unknown
2022-02-20T17:19:30.4151197Z   	0x7ff056834ea4 - Unknown
2022-02-20T17:19:30.4152791Z   	0x7ff05676c466 - Unknown
2022-02-20T17:19:30.4154346Z   	0x7ff05678bda3 - Unknown
2022-02-20T17:19:30.4156012Z   	0x7ff0566ddace - Unknown
2022-02-20T17:19:30.4159329Z   	0x7ff0566afec1 - Unknown
2022-02-20T17:19:30.4161158Z   	0x7ff056736e39 - Unknown
2022-02-20T17:19:30.4162645Z   	0x7ff056727b72 - Unknown
2022-02-20T17:19:30.4164267Z   	0x7ff0567195d2 - Unknown
2022-02-20T17:19:30.4165995Z   	0x7ff05679480e - Unknown
2022-02-20T17:19:30.4167524Z   	0x55dcb433a5aa - Unknown
2022-02-20T17:19:30.4169123Z   	0x7ff057e56555 - Unknown
2022-02-20T17:19:30.4170801Z   	0x55dcb4338029 - Unknown
2022-02-20T17:19:30.6293373Z   [New LWP 10203]
2022-02-20T17:19:30.6295390Z   [Thread debugging using libthread_db enabled]
2022-02-20T17:19:30.6296826Z   Using host libthread_db library "/lib64/libthread_db.so.1".
2022-02-20T17:19:31.2218749Z EXEC : warning : the debug information found in "/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so.dbg" does not match "/__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so" (CRC mismatch). [/__w/1/s/src/mono/msbuild/aot-compile.proj]
2022-02-20T17:19:31.2222043Z   
2022-02-20T17:19:31.2239353Z   Missing separate debuginfo for /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2242226Z   Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/92/5d8bab89f15c6990ba944db4dfd44d746cfdb8.debug
2022-02-20T17:19:31.2395130Z   0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2412985Z     Id   Target Id         Frame 
2022-02-20T17:19:31.2414838Z     2    Thread 0x7ff054fff700 (LWP 10203) "SGen worker" 0x00007ff058a2da35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2432285Z   * 1    Thread 0x7ff059056740 (LWP 10202) "corerun" 0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2579202Z   
2022-02-20T17:19:31.2590000Z   Thread 2 (Thread 0x7ff054fff700 (LWP 10203)):
2022-02-20T17:19:31.2591755Z   #0  0x00007ff058a2da35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2593635Z   #1  0x00007ff05669f7f3 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2609971Z   #2  0x00007ff058a29ea5 in start_thread () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2611736Z   #3  0x00007ff057f329fd in clone () from /lib64/libc.so.6
2022-02-20T17:19:31.2623338Z   
2022-02-20T17:19:31.2633759Z   Thread 1 (Thread 0x7ff059056740 (LWP 10202)):
2022-02-20T17:19:31.2635970Z   #0  0x00007ff058a311d9 in waitpid () from /lib64/libpthread.so.0
2022-02-20T17:19:31.2637754Z   #1  0x00007ff0567b8b07 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2649895Z   #2  0x00007ff05675936e in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2659790Z   #3  0x00007ff0567b8298 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2661758Z   #4  <signal handler called>
2022-02-20T17:19:31.2663461Z   #5  0x00007ff057e6a387 in raise () from /lib64/libc.so.6
2022-02-20T17:19:31.2664790Z   #6  0x00007ff057e6ba78 in abort () from /lib64/libc.so.6
2022-02-20T17:19:31.2666396Z   #7  0x00007ff056834875 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2678513Z   #8  0x00007ff056646a33 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2685693Z   #9  0x00007ff056834cdd in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2698988Z   #10 0x00007ff056834e45 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2714534Z   #11 0x00007ff056834ea4 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2728485Z   #12 0x00007ff05676c466 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2742435Z   #13 0x00007ff05678bda3 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2755997Z   #14 0x00007ff0566ddace in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2769258Z   #15 0x00007ff0566afec1 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2783243Z   #16 0x00007ff056736e39 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2796491Z   #17 0x00007ff056727b72 in ?? () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2809466Z   #18 0x00007ff0567195d2 in mono_main () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.2823406Z   #19 0x00007ff05679480e in monovm_execute_assembly () from /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/Tests/Core_Root/libcoreclr.so
2022-02-20T17:19:31.3348964Z   #20 0x000055dcb433a5aa in run (config=...) at /__w/1/s/src/coreclr/hosts/corerun/corerun.cpp:368
2022-02-20T17:19:31.3363557Z   #21 main (argc=<optimized out>, argv=<optimized out>) at /__w/1/s/src/coreclr/hosts/corerun/corerun.cpp:563
2022-02-20T17:19:31.3425311Z   [Inferior 1 (process 10202) detached]
2022-02-20T17:19:31.3507046Z   
2022-02-20T17:19:31.3508333Z   =================================================================
2022-02-20T17:19:31.3509529Z   	Basic Fault Address Reporting
2022-02-20T17:19:31.3510661Z   =================================================================
2022-02-20T17:19:31.3512250Z   Memory around native instruction pointer (0x7ff057e6a387):0x7ff057e6a377  48 63 d7 48 63 f6 48 63 f9 b8 ea 00 00 00 0f 05  Hc.Hc.Hc........
2022-02-20T17:19:31.3513887Z   0x7ff057e6a387  48 3d 00 f0 ff ff 77 1e f3 c3 0f 1f 80 00 00 00  H=....w.........
2022-02-20T17:19:31.3515279Z   0x7ff057e6a397  00 85 c9 7f db 89 c8 f7 d8 81 e1 ff ff ff 7f 0f  ................
2022-02-20T17:19:31.3519503Z   0x7ff057e6a3a7  44 c6 89 c1 eb ca 48 8b 15 9c 0a 39 00 f7 d8 64  D.....H....9...d
2022-02-20T17:19:31.9097999Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/Arm/Rdm/Rdm_ro/Rdm_ro.dll
2022-02-20T17:19:31.9101723Z   AOTID 496F54B8-0F3D-AA06-C32F-9EB82E567870
2022-02-20T17:19:31.9104767Z   Executing opt: "/__w/1/s/artifacts/bin/mono/Linux.x64.Release/opt" -f -O2 -disable-tail-calls -place-safepoints -spp-all-backedges -mattr=sse4.2,popcnt,lzcnt,bmi,bmi2,pclmul,aes -o "mono_aot_6QfcWC/temp.opt.bc" "mono_aot_6QfcWC/temp.bc"
2022-02-20T17:19:31.9109761Z   Executing llc: "/__w/1/s/artifacts/bin/mono/Linux.x64.Release/llc"  -march=x86-64 -mcpu=generic -enable-implicit-null-checks -disable-fault-maps -asm-verbose=false -disable-gnu-eh-frame -enable-mono-eh-frame -mono-eh-frame-symbol=mono_aot_Rdm_ro_eh_frame -disable-tail-calls -no-x86-call-frame-opt -relocation-model=pic -filetype=obj -mattr=sse4.2,popcnt,lzcnt,bmi,bmi2,pclmul,aes -o "mono_aot_6QfcWC/temp-llvm.o" "mono_aot_6QfcWC/temp.opt.bc"
2022-02-20T17:19:31.9113118Z   Compiled: 2033/2033
2022-02-20T17:19:31.9115090Z   Executing the native assembler: "as" --64  -o /tmp/mono_aot_Af32J0.o /tmp/mono_aot_Af32J0
2022-02-20T17:19:31.9118024Z   Executing the native linker: "ld" -shared -o /__w/1/s/artifacts/tests/coreclr/Linux.x64.Release/JIT/HardwareIntrinsics/Arm/Rdm/Rdm_ro/Rdm_ro.dll.so.tmp "mono_aot_6QfcWC/temp-llvm.o" /tmp/mono_aot_Af32J0.o 

@vargaz
Copy link
Contributor

vargaz commented Feb 20, 2022

This will fix it:
https://gist.github.com/vargaz/78f0b5d0710c5de7a0131b0f9a6ea5d3

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only a few comments.

@fanyang-mono
Copy link
Member

fanyang-mono commented Feb 22, 2022

I had created an issue to clean up the code for type checks of vector elements. Haven't get to it yet. (#65318)

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
if (!varTypeIsFloating(simdBaseType) && (op != nullptr))
{
GenTree* cmp =
comp->gtNewSimdHWIntrinsicNode(simdType, op, NI_AdvSimd_Arm64_MaxAcross, CORINFO_TYPE_UBYTE, simdSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Arm® Cortex®-A76 Software Optimization Guide:
UMAXV, 16B has Exec latency 6 and Execution throughput 1/2
while UMAXV, 4H/4S has Exec latency 3 and Execution throughput 1

Do we want CORINFO_TYPE_USHORT/CORINFO_TYPE_UINT as a base type instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's change, Although, I had this in mind when I was benchmarking it and saw zero difference, but I now I see why:
image
(Apple M1, Firestorm core)

@kunalspathak
Copy link
Member

kunalspathak commented Mar 3, 2022

Improvements in dotnet/perf-autofiling-issues#3833 and dotnet/perf-autofiling-issues#3829

@EgorBo
Copy link
Member Author

EgorBo commented Mar 3, 2022

wow, it's more than I expected

@EgorBo EgorBo deleted the arm-fast-cmp-zero-vec branch March 3, 2022 17:56
@ghost ghost locked as resolved and limited conversation to collaborators Apr 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Faster comparison against Vector128<>.Zero
6 participants