Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsify Interlocked.And and Interlocked.Or on XARCH #96258

Merged
merged 4 commits into from
Jan 5, 2024

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 21, 2023

Closes #32239

These were only implemented on ARM64 (and RISC-V). Unlike other Interlocked APIs, these return original value so it makes them impossible to optimize on XARCH when that original value is used. Looks like all usages of it in BCL don't need return value so we can optimize this case. Example:

Interlocked.And(ref m_stateFlags, ~(int)TaskStateFlags.WaitCompletionNotification);

void Test1(ref int x, int y) => Interlocked.Or(ref x, y); // return value is not used
int  Test2(ref int x, int y) => Interlocked.Or(ref x, y); // return value is used

Was:

; Method Tests:Test1(byref,int):this (FullOpts)
G_M7530_IG01:
       push     rax
G_M7530_IG02:
       mov      eax, dword ptr [rdx]
       jmp      SHORT G_M7530_IG03
       align    [0 bytes for IG03]
G_M7530_IG03:
       mov      ecx, eax
       or       ecx, r8d
       mov      dword ptr [rsp+0x04], eax
       lock     
       cmpxchg  dword ptr [rdx], ecx
       mov      ecx, dword ptr [rsp+0x04]
       cmp      eax, ecx
       je       SHORT G_M7530_IG05
G_M7530_IG04:
       mov      ecx, eax
       mov      eax, ecx
       jmp      SHORT G_M7530_IG03
G_M7530_IG05:
       add      rsp, 8
       ret      
; Total bytes of code: 37


; Method Tests:Test2(byref,int):int:this (FullOpts)
G_M30720_IG01:
       push     rax
G_M30720_IG02:
       mov      eax, dword ptr [rdx]
       jmp      SHORT G_M30720_IG03
       align    [0 bytes for IG03]
G_M30720_IG03:
       mov      ecx, eax
       or       ecx, r8d
       mov      dword ptr [rsp+0x04], eax
       lock     
       cmpxchg  dword ptr [rdx], ecx
       mov      ecx, dword ptr [rsp+0x04]
       cmp      eax, ecx
       je       SHORT G_M30720_IG05
G_M30720_IG04:
       mov      ecx, eax
       mov      eax, ecx
       jmp      SHORT G_M30720_IG03
G_M30720_IG05:
       mov      eax, ecx
G_M30720_IG06:
       add      rsp, 8
       ret      
; Total bytes of code: 39

Now:

; Method Tests:Test1(byref,int):this (FullOpts)
G_M7530_IG01:
G_M7530_IG02:
       lock     
       or       dword ptr [rdx], r8d
G_M7530_IG03:
       ret      
; Total bytes of code: 5


; Method Tests:Test2(byref,int):int:this (FullOpts)
G_M30720_IG01:
G_M30720_IG02:
       mov      eax, dword ptr [rdx]
G_M30720_IG03:
       mov      ecx, eax
       or       ecx, r8d
       lock     
       cmpxchg  dword ptr [rdx], ecx
       jne      SHORT G_M30720_IG03
G_M30720_IG04:
       ret      
; Total bytes of code: 14

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 21, 2023
@ghost ghost assigned EgorBo Dec 21, 2023
@ghost
Copy link

ghost commented Dec 21, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #32239

These were only implemented on ARM64 (and RISC-V). Unlike other Interlocked APIs, these return original value so it makes them impossible to optimize on XARCH when that original value is used. Looks like all usages of it in BCL don't need return value so we can optimize this case.

void Test(ref int x, int y) => Interlocked.Or(ref x, y);

Was:

; Method Program:Test(byref,int):this (FullOpts)
G_M55114_IG01:  ;; offset=0x0000
       push     rax
						;; size=1 bbWeight=1 PerfScore 1.00

G_M55114_IG02:  ;; offset=0x0001
       mov      eax, dword ptr [rdx]
       jmp      SHORT G_M55114_IG03
       align    [0 bytes for IG03]
						;; size=4 bbWeight=1 PerfScore 4.00

G_M55114_IG03:  ;; offset=0x0005
       mov      ecx, eax
       or       ecx, r8d
       mov      dword ptr [rsp+0x04], eax
       lock     
       cmpxchg  dword ptr [rdx], ecx
       mov      ecx, dword ptr [rsp+0x04]
       cmp      eax, ecx
       je       SHORT G_M55114_IG05
						;; size=21 bbWeight=8 PerfScore 174.00

G_M55114_IG04:  ;; offset=0x001A
       mov      ecx, eax
       mov      eax, ecx
       jmp      SHORT G_M55114_IG03
						;; size=6 bbWeight=4 PerfScore 10.00

G_M55114_IG05:  ;; offset=0x0020
       add      rsp, 8
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 37

Now:

; Method Program:Test(byref,int):this (FullOpts)
G_M55114_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M55114_IG02:  ;; offset=0x0000
       lock     
       or       dword ptr [rdx], r8d
						;; size=4 bbWeight=1 PerfScore 16.00

G_M55114_IG03:  ;; offset=0x0004
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 5
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo force-pushed the intrinsify-interlocked-and-or branch from 62cd091 to 4759e86 Compare January 4, 2024 21:11
@EgorBo EgorBo force-pushed the intrinsify-interlocked-and-or branch from bebdc5c to 988c8b7 Compare January 4, 2024 22:04
@EgorBo
Copy link
Member Author

EgorBo commented Jan 5, 2024

@dotnet/jit-contrib @BruceForstall PTAL, simple PR, optimizes Interlocked.And/Add the way native compilers do: https://godbolt.org/z/csfd3nG89

Diffs

A few size and TP regressions for the cases when we previously didn't inline Interlocked.And/Or and now it's inlined as intrinsic in both Tier0 (non-debug and non-minopts) and Tier1

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a comment in lsra.

src/coreclr/jit/importercalls.cpp Show resolved Hide resolved
src/coreclr/jit/lsraxarch.cpp Show resolved Hide resolved
@EgorBo EgorBo merged commit a5ff7e6 into dotnet:main Jan 5, 2024
129 checks passed
@EgorBo EgorBo deleted the intrinsify-interlocked-and-or branch January 5, 2024 21:33
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jan 5, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider making Interlocked.And and friends into JIT intrinsics
4 participants