Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable storing pinned locals in registers #63397

Open
Sergio0694 opened this issue Jan 5, 2022 · 10 comments
Open

Enable storing pinned locals in registers #63397

Sergio0694 opened this issue Jan 5, 2022 · 10 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@Sergio0694
Copy link
Contributor

Sergio0694 commented Jan 5, 2022

This is a follow up from (RIP) #55273, specifically this comment from @jkotas.
Opening this issue for tracking and avoid that getting lost.

Overview

Consider this snippet:

static unsafe void A(ref byte r)
{
    fixed (void* p = &r)
    {
        B(p);
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void B(void* p)
{
}

This currently results in:

sub rsp, 0x28
mov [rsp+0x20], rcx
call 0x00007ffaecb70028
xor eax, eax
mov [rsp+0x20], rax
add rsp, 0x28
ret

Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths.
It would be nice if the JIT added support for using a register to store pinned locals, when possible.
As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking.
The method A from above could then become something like this:

mov rbx, rcx
call 0x00007ffaecb70028
xor rbx, rbx
ret

Here I just used rbx to store the pinned local (just picked the first callee-saved register). I do realize there's plenty of work to make this work and all the various GC data structures need to be updated accordingly to enable tracking, this is the general idea.

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄
Additionally, given this could be used to restore the RuntimeHelpers.GetHashCode optimization by porting the happy path to C# (as I did in #55273, but possibly without the GC hole ahah), it would automatically speedup virtually every dictionary out there using random reference types as keys. Or, any other data structure that would call GetHashCode at some point on an object that didn't override the default object.GetHashCode implementation.

cc. @EgorBo @SingleAccretion

category:cq
theme:pinning

@Sergio0694 Sergio0694 added the tenet-performance Performance related issue label Jan 5, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jan 5, 2022
@ghost
Copy link

ghost commented Jan 5, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a follow up from (RIP) #55273, specifically this comment from @jkotas.
Opening this issue for tracking and avoid that getting lost.

Overview

Consider this snippet:

static unsafe void A(ref byte r)
{
    fixed (void* p = &r)
    {
        B(p);
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void B(void* p)
{
}

This currently results in:

sub rsp, 0x28
mov [rsp+0x20], rcx
call 0x00007ffaecb70028
xor eax, eax
mov [rsp+0x20], rax
add rsp, 0x28
ret

Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths.
It would be nice if the JIT added support for using a register to store pinned locals, when possible.
As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking.
The method A from above could then become something like this:

mov rbx, rcx
call 0x00007ffaecb70028
xor rbx, rbx
ret

In this example I just used rbx to store the pinned local (just picked the first callee-saved register).

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄

cc. @EgorBo @SingleAccretion

Author: Sergio0694
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@tannergooding
Copy link
Member

tannergooding commented Jan 5, 2022

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code

I'm not super convinced of the wins here myself.

I think that being able to elide the pin in the case that the pinned value is coming from the stack would be better: #40553

The "friendly signature" for something like BOOL QueryPerformanceFrequency(LARGE_INTEGER* lpFrequency) is bool QueryPerformanceFrequency(out long frequency), in which case the typical usage is likely if (QueryPerformanceFrequency(out var frequency)). In that scenario, the out has to be pinned, even though the P/Invoke wrapper, once inlined, could see that pinning a stack local isn't required.

Regular pinning is effectively a stack spill however, and in the terms of "typical" method calls, isn't going to be much more expensive than the spilling or register shuffling that already typically happens to meet the callee/caller saved register requirements. Further, in the case of P/Invoke, this spill is nothing compared to the transition stub which already spills basically everything to ensure that any GC tracked data is not in a register (see #54107 (comment), which shows the disassembly for such a transition).

@Sergio0694
Copy link
Contributor Author

I can still think of many cases where the stack spill for the pinning would be the only one. Like, this would be the case for RuntimeHelpers.GetHashCode, which otherwise is all entirely inlined and with all locals enregistered. I would imagine that'd be nice to optimize given the speedup it could give downstream to all users indirectly relying on it for hashcode-based data structures. Also, there could be plenty of other cases where the pinning is the biggest overhead in a method (eg. ComPtr<T>.GetPinnableReference() comes to mind, which would otherwise basically be free if the pinned local was enregistered. All in all I guess I'm just saying, #40553 would sure be nice to have too, but I don't see why this one would be mutually exclusive. Wouldn't it be nice to have both? 😄

@JulieLeeMSFT
Copy link
Member

cc @kunalspathak.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 25, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Jan 25, 2022
@Sergio0694
Copy link
Contributor Author

Sergio0694 commented Jun 3, 2022

For reference, I did try to use fixed as a follow up to #55273, ie:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        uint syncBlockValue;

        fixed (byte* pData = &o.GetRawData())
        {
            syncBlockValue = ((ObjectHeader*)&((nint*)pData)[-2])->SyncBlockValue;
        }

        const uint BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX = 0x08000000;
        const uint BIT_SBLK_IS_HASHCODE = 0x04000000;
        const uint BITS_IS_VALID_HASHCODE = BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX | BIT_SBLK_IS_HASHCODE;
        const int HASHCODE_BITS = 26;
        const uint MASK_HASHCODE = (1u << HASHCODE_BITS) - 1u;

        if ((syncBlockValue & BITS_IS_VALID_HASHCODE) == BITS_IS_VALID_HASHCODE)
        {
            return unchecked((int)(syncBlockValue & MASK_HASHCODE));
        }
    }

    return InternalGetHashCode(o);
}

The performance though was way worse than the current one:

Method Branch Mean Error StdDev Ratio RatioSD Code Size
GetHashCode PR 0.6446 ns 0.0021 ns 0.0018 ns 1.45 0.03 78 B
GetHashCode main 0.4462 ns 0.0102 ns 0.0091 ns 1.00 0.00 9 B

The asm in particular was... Interesting:

; ObjectGetHashCodeBenchmark.GetHashCode()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+20],rax
       mov       rcx,[rcx+8]
       test      rcx,rcx
       je        short M00_L00
       lea       rax,[rcx+8]
       mov       [rsp+20],rax
       mov       rax,[rsp+20]
       mov       eax,[rax+0FFF4]
       xor       edx,edx
       mov       [rsp+20],rdx
       mov       edx,eax
       and       edx,0C000000
       cmp       edx,0C000000
       jne       short M00_L00
       and       eax,3FFFFFF
       jmp       short M00_L01
M00_L00:
       call      System.Runtime.CompilerServices.RuntimeHelpers.InternalGetHashCode(System.Object)
M00_L01:
       nop
       add       rsp,28
       ret
; Total bytes of code 78

Looks like there's lots of room for improvements to the codegen in this case? 🤔

I will also say: I still think an Unsafe.AtomicAddByteOffsetAndRead intrinsic would be useful.
It'd both solve this case (no need to use fixed at all), and it'd potentially be useful in other scenarios as well.

As in, the code above could then just be:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        ref byte dataRef = ref o.GetRawData();
        nint headerData = Unsafe.AtomicAddByteOffsetAndRead<nint>(ref dataRef, -8);
        uint syncBlockValue = ((ObjectHeader*)&headerData)->SyncBlockValue;

        // Rest of the code (with no GC holes)
    }

    return InternalGetHashCode(o);
}

@SupinePandora43
Copy link

@Sergio0694 more like Unsafe.AtomicSubstractByteOffsetAndRead

@Sergio0694
Copy link
Contributor Author

Sergio0694 commented Jun 3, 2022

I mean yes, that was just an example (plus you'd get the same anyway with a negative offset).
My point was just "some API that can atomically subtract and read with GC tracking and no pinning" 🙂

Of course, such an API would only be allowed to read primitives, or even just int/nuint, doesn't really matter.

@jakobbotsch
Copy link
Member

This would likely fix #35748 too.

@VSadov
Copy link
Member

VSadov commented Feb 28, 2024

I'd like to +1 this.

In NativeAOT we do have GetHashcode implemented in managed code. Also thin locks.
All operations with ObjectHeader need to pin and some code paths are otherwise fairly simple. Releasing a thin lock, for example is just a few checks for rare cases and then setting a bit in the header.

The part that pinning introduces stack locals hurts the scenario somewhat.

@VSadov
Copy link
Member

VSadov commented Feb 28, 2024

Also see: #97997 , that reminded me of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants