Enable storing pinned locals in registers #63397

Sergio0694 · 2022-01-05T15:32:27Z

This is a follow up from (RIP) #55273, specifically this comment from @jkotas.
Opening this issue for tracking and avoid that getting lost.

Overview

Consider this snippet:

static unsafe void A(ref byte r)
{
    fixed (void* p = &r)
    {
        B(p);
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void B(void* p)
{
}

This currently results in:

sub rsp, 0x28
mov [rsp+0x20], rcx
call 0x00007ffaecb70028
xor eax, eax
mov [rsp+0x20], rax
add rsp, 0x28
ret

Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths.
It would be nice if the JIT added support for using a register to store pinned locals, when possible.
As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking.
The method A from above could then become something like this:

mov rbx, rcx
call 0x00007ffaecb70028
xor rbx, rbx
ret

Here I just used rbx to store the pinned local (just picked the first callee-saved register). I do realize there's plenty of work to make this work and all the various GC data structures need to be updated accordingly to enable tracking, this is the general idea.

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄
Additionally, given this could be used to restore the RuntimeHelpers.GetHashCode optimization by porting the happy path to C# (as I did in #55273, but possibly without the GC hole ahah), it would automatically speedup virtually every dictionary out there using random reference types as keys. Or, any other data structure that would call GetHashCode at some point on an object that didn't override the default object.GetHashCode implementation.

cc. @EgorBo @SingleAccretion

category:cq
theme:pinning

The text was updated successfully, but these errors were encountered:

ghost · 2022-01-05T15:32:31Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a follow up from (RIP) #55273, specifically this comment from @jkotas.
Opening this issue for tracking and avoid that getting lost.

Overview

Consider this snippet:

static unsafe void A(ref byte r)
{
    fixed (void* p = &r)
    {
        B(p);
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void B(void* p)
{
}

This currently results in:

sub rsp, 0x28
mov [rsp+0x20], rcx
call 0x00007ffaecb70028
xor eax, eax
mov [rsp+0x20], rax
add rsp, 0x28
ret

Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths.
It would be nice if the JIT added support for using a register to store pinned locals, when possible.
As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking.
The method A from above could then become something like this:

mov rbx, rcx
call 0x00007ffaecb70028
xor rbx, rbx
ret

In this example I just used rbx to store the pinned local (just picked the first callee-saved register).

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄

cc. @EgorBo @SingleAccretion

Author:	Sergio0694
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

tannergooding · 2022-01-05T15:43:48Z

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code

I'm not super convinced of the wins here myself.

I think that being able to elide the pin in the case that the pinned value is coming from the stack would be better: #40553

The "friendly signature" for something like BOOL QueryPerformanceFrequency(LARGE_INTEGER* lpFrequency) is bool QueryPerformanceFrequency(out long frequency), in which case the typical usage is likely if (QueryPerformanceFrequency(out var frequency)). In that scenario, the out has to be pinned, even though the P/Invoke wrapper, once inlined, could see that pinning a stack local isn't required.

Regular pinning is effectively a stack spill however, and in the terms of "typical" method calls, isn't going to be much more expensive than the spilling or register shuffling that already typically happens to meet the callee/caller saved register requirements. Further, in the case of P/Invoke, this spill is nothing compared to the transition stub which already spills basically everything to ensure that any GC tracked data is not in a register (see #54107 (comment), which shows the disassembly for such a transition).

Sergio0694 · 2022-01-05T15:49:01Z

I can still think of many cases where the stack spill for the pinning would be the only one. Like, this would be the case for RuntimeHelpers.GetHashCode, which otherwise is all entirely inlined and with all locals enregistered. I would imagine that'd be nice to optimize given the speedup it could give downstream to all users indirectly relying on it for hashcode-based data structures. Also, there could be plenty of other cases where the pinning is the biggest overhead in a method (eg. ComPtr<T>.GetPinnableReference() comes to mind, which would otherwise basically be free if the pinned local was enregistered. All in all I guess I'm just saying, #40553 would sure be nice to have too, but I don't see why this one would be mutually exclusive. Wouldn't it be nice to have both? 😄

JulieLeeMSFT · 2022-01-25T19:16:53Z

cc @kunalspathak.

Sergio0694 · 2022-06-03T11:55:09Z

For reference, I did try to use fixed as a follow up to #55273, ie:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        uint syncBlockValue;

        fixed (byte* pData = &o.GetRawData())
        {
            syncBlockValue = ((ObjectHeader*)&((nint*)pData)[-2])->SyncBlockValue;
        }

        const uint BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX = 0x08000000;
        const uint BIT_SBLK_IS_HASHCODE = 0x04000000;
        const uint BITS_IS_VALID_HASHCODE = BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX | BIT_SBLK_IS_HASHCODE;
        const int HASHCODE_BITS = 26;
        const uint MASK_HASHCODE = (1u << HASHCODE_BITS) - 1u;

        if ((syncBlockValue & BITS_IS_VALID_HASHCODE) == BITS_IS_VALID_HASHCODE)
        {
            return unchecked((int)(syncBlockValue & MASK_HASHCODE));
        }
    }

    return InternalGetHashCode(o);
}

The performance though was way worse than the current one:

Method	Branch	Mean	Error	StdDev	Ratio	RatioSD	Code Size
GetHashCode	PR	0.6446 ns	0.0021 ns	0.0018 ns	1.45	0.03	78 B
GetHashCode	main	0.4462 ns	0.0102 ns	0.0091 ns	1.00	0.00	9 B

The asm in particular was... Interesting:

; ObjectGetHashCodeBenchmark.GetHashCode()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+20],rax
       mov       rcx,[rcx+8]
       test      rcx,rcx
       je        short M00_L00
       lea       rax,[rcx+8]
       mov       [rsp+20],rax
       mov       rax,[rsp+20]
       mov       eax,[rax+0FFF4]
       xor       edx,edx
       mov       [rsp+20],rdx
       mov       edx,eax
       and       edx,0C000000
       cmp       edx,0C000000
       jne       short M00_L00
       and       eax,3FFFFFF
       jmp       short M00_L01
M00_L00:
       call      System.Runtime.CompilerServices.RuntimeHelpers.InternalGetHashCode(System.Object)
M00_L01:
       nop
       add       rsp,28
       ret
; Total bytes of code 78

Looks like there's lots of room for improvements to the codegen in this case? 🤔

I will also say: I still think an Unsafe.AtomicAddByteOffsetAndRead intrinsic would be useful.
It'd both solve this case (no need to use fixed at all), and it'd potentially be useful in other scenarios as well.

As in, the code above could then just be:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        ref byte dataRef = ref o.GetRawData();
        nint headerData = Unsafe.AtomicAddByteOffsetAndRead<nint>(ref dataRef, -8);
        uint syncBlockValue = ((ObjectHeader*)&headerData)->SyncBlockValue;

        // Rest of the code (with no GC holes)
    }

    return InternalGetHashCode(o);
}

SupinePandora43 · 2022-06-03T15:54:33Z

@Sergio0694 more like Unsafe.AtomicSubstractByteOffsetAndRead

Sergio0694 · 2022-06-03T15:56:45Z

I mean yes, that was just an example (plus you'd get the same anyway with a negative offset).
My point was just "some API that can atomically subtract and read with GC tracking and no pinning" 🙂

Of course, such an API would only be allowed to read primitives, or even just int/nuint, doesn't really matter.

jakobbotsch · 2022-12-12T15:43:50Z

This would likely fix #35748 too.

VSadov · 2024-02-28T02:03:48Z

I'd like to +1 this.

In NativeAOT we do have GetHashcode implemented in managed code. Also thin locks.
All operations with ObjectHeader need to pin and some code paths are otherwise fairly simple. Releasing a thin lock, for example is just a few checks for rare cases and then setting a bit in the header.

The part that pinning introduces stack locals hurts the scenario somewhat.

VSadov · 2024-02-28T02:10:17Z

Also see: #97997 , that reminded me of this issue.

Sergio0694 added the tenet-performance Performance related issue label Jan 5, 2022

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jan 5, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 25, 2022

JulieLeeMSFT added this to the Future milestone Jan 25, 2022

jkotas mentioned this issue Dec 12, 2022

[NativeAOT] Thin locks #79519

Merged

jkotas mentioned this issue Apr 13, 2024

Introduce RuntimeHelpers.GetObjectHeader and Optimize Object.GetHashCode() with it #100999

Closed

PaulusParssinen mentioned this issue May 9, 2024

BenchmarkDotNet snippet to compare Span<byte> to raw pointer microsoft/garnet#370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable storing pinned locals in registers #63397

Enable storing pinned locals in registers #63397

Sergio0694 commented Jan 5, 2022 •

edited by BruceForstall

Loading

ghost commented Jan 5, 2022

Overview

tannergooding commented Jan 5, 2022 •

edited

Loading

Sergio0694 commented Jan 5, 2022

JulieLeeMSFT commented Jan 25, 2022

Sergio0694 commented Jun 3, 2022 •

edited

Loading

SupinePandora43 commented Jun 3, 2022

Sergio0694 commented Jun 3, 2022 •

edited

Loading

jakobbotsch commented Dec 12, 2022

VSadov commented Feb 28, 2024

VSadov commented Feb 28, 2024 •

edited

Loading

Enable storing pinned locals in registers #63397

Enable storing pinned locals in registers #63397

Comments

Sergio0694 commented Jan 5, 2022 • edited by BruceForstall Loading

Overview

ghost commented Jan 5, 2022

Overview

tannergooding commented Jan 5, 2022 • edited Loading

Sergio0694 commented Jan 5, 2022

JulieLeeMSFT commented Jan 25, 2022

Sergio0694 commented Jun 3, 2022 • edited Loading

SupinePandora43 commented Jun 3, 2022

Sergio0694 commented Jun 3, 2022 • edited Loading

jakobbotsch commented Dec 12, 2022

VSadov commented Feb 28, 2024

VSadov commented Feb 28, 2024 • edited Loading

Sergio0694 commented Jan 5, 2022 •

edited by BruceForstall

Loading

tannergooding commented Jan 5, 2022 •

edited

Loading

Sergio0694 commented Jun 3, 2022 •

edited

Loading

Sergio0694 commented Jun 3, 2022 •

edited

Loading

VSadov commented Feb 28, 2024 •

edited

Loading