Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Thin locks #79519

Merged
merged 25 commits into from
Dec 15, 2022
Merged

[NativeAOT] Thin locks #79519

merged 25 commits into from
Dec 15, 2022

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Dec 12, 2022

This change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT.
In more complex scenarios such as blocking waits, a thin lock transparently transitions into a "fat" lock (implemented by System.Threading.Lock) that supports more scenarios.

Fixes: #75108
Fixes: #68891

After this change the performance of C# lock statement is significantly better in both scenarios - Thin (76% faster) and Fat (30% faster). For NativeAOT this is a considerable improvement. Thin lock is also allocation-free.

CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario,
Managed implementation may eventually match or be faster than native, but we need to address a few codegen inefficiencies. Fixing that should happen separately from this PR. See referenced issues.
The main difference is accessing CurrentManagedThreadId. Accessing a thread static requires a method call, while c++ implementation of locks in CoreClr just inlines the access to a native TLS variable. The cost of this call really stands out and we need to call on both acquire and release parts.

These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect.

I had to make some changes in the fat Lock as well.

  • a few tweaks to make no-contention code path faster.
  • excessive hardcoded spinning replaced with adaptive spinning
  • exponential backoff on contentions for the state of the lock
  • fairness issues, especially once excessive spinning was addressed. (the lock is unfair by design, but we should not allow cases when one thread forever starves everyone else)

Related issues:

@ghost
Copy link

ghost commented Dec 12, 2022

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

This change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT.
In scenarios that require blocking wait and a few others, a thin lock transparently transition into a "fat" lock (implemented by System.Threading.Lock) that supports more scenarios.

Fixes: #75108
Fixes: #68891

After this change the performance of C# lock statement is significantly better in both scenarios - Thin (76% faster) and Fat (30% faster). For NativeAOT this is a considerable improvement.

CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario,
Managed implementation may eventually match or be faster than native, but we need to address a few codegen inefficiencies. Fixing that should happen separately from this PR. I will log issues to follow up.
The main difference is accessing CurrentManagedThreadId. I have made the native side a bit faster, but it is still a method call to the native runtime for NativeAOT, while c++ implementation of locks just inlines the access. The cost of this call really stands out and we need to call on both acquire and release parts.

These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect.

I had to make some changes in the fat Lock as well.

  • a few tweaks to make no-contention code path faster.
  • excessive hardcoded spinning replaced with adaptive spinning
  • exponential backoff on contentions for the state of the lock
  • fairness issues, especially once excessive spinning was addressed. (the lock is unfair by design, but we should not allow cases when one thread completely starves everyone else)

Related issues:

Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Dec 12, 2022

results from the same benchmark as in #68891
cc: @adamsitnik

ConcurrentDictionary CreateAddAndClear<Int32>

Before the change:

[2022/12/09 18:26:29][INFO] |               Method |       Runtime | Size |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | RatioSD |    Gen0 |    Gen1 |   Gen2 | Allocated | Alloc Ratio |
[2022/12/09 18:26:29][INFO] |--------------------- |-------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|--------:|--------:|--------:|-------:|----------:|------------:|
[2022/12/09 18:26:29][INFO] | ConcurrentDictionary |      .NET 8.0 |  512 |  32.76 us | 0.339 us | 0.301 us |  32.76 us |  32.24 us |  33.27 us |  1.00 |    0.00 |  6.3073 |  1.4335 |      - | 107.08 KB |        1.00 |
[2022/12/09 18:26:29][INFO] | ConcurrentDictionary | NativeAOT 8.0 |  512 | 339.93 us | 5.609 us | 4.972 us | 339.32 us | 334.65 us | 349.35 us | 10.38 |    0.12 | 11.7546 | 11.4679 | 1.7202 | 171.08 KB |        1.60 |

After:
[2022/12/09 19:14:37][INFO] |               Method |       Runtime | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD |   Gen0 |   Gen1 | Allocated | Alloc Ratio |
[2022/12/09 19:14:37][INFO] |--------------------- |-------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|-------:|-------:|----------:|------------:|
[2022/12/09 19:14:37][INFO] | ConcurrentDictionary |      .NET 8.0 |  512 | 33.99 us | 0.336 us | 0.298 us | 33.98 us | 33.49 us | 34.54 us |  1.00 |    0.00 | 6.3073 | 1.4335 | 107.08 KB |        1.00 |
[2022/12/09 19:14:37][INFO] | ConcurrentDictionary | NativeAOT 8.0 |  512 | 36.77 us | 0.449 us | 0.398 us | 36.88 us | 36.23 us | 37.58 us |  1.08 |    0.02 | 6.3073 | 1.4335 | 107.08 KB |        1.00 |

We see a lot of improvements in throughput and allocations too (since thin lock is a no-allocation path).
We are closer, but still not on par with CoreClr. As I mentioned, some codegen issues stand in the way and should be addressed separately.

Perf_Monitor benchmarks

Before the change:
[2022/12/09 19:27:17][INFO] |       Method |       Runtime |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | RatioSD | Allocated | Alloc Ratio |
[2022/12/09 19:27:17][INFO] |------------- |-------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
[2022/12/09 19:27:17][INFO] |    EnterExit |      .NET 8.0 | 4.458 ns | 0.0293 ns | 0.0229 ns | 4.458 ns | 4.415 ns | 4.501 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:27:17][INFO] |    EnterExit | NativeAOT 8.0 | 8.048 ns | 0.0152 ns | 0.0134 ns | 8.056 ns | 8.028 ns | 8.056 ns |  1.80 |    0.01 |         - |          NA |
[2022/12/09 19:27:17][INFO] |              |               |          |           |           |          |          |          |       |         |           |             |
[2022/12/09 19:27:17][INFO] | TryEnterExit |      .NET 8.0 | 5.861 ns | 0.3099 ns | 0.3445 ns | 5.677 ns | 5.591 ns | 6.508 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:27:17][INFO] | TryEnterExit | NativeAOT 8.0 | 8.028 ns | 0.0000 ns | 0.0000 ns | 8.028 ns | 8.028 ns | 8.028 ns |  1.35 |    0.08 |         - |          NA |

After:
[2022/12/09 19:23:33][INFO] |       Method |       Runtime |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | RatioSD | Allocated | Alloc Ratio |
[2022/12/09 19:23:33][INFO] |------------- |-------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
[2022/12/09 19:23:33][INFO] |    EnterExit |      .NET 8.0 | 4.439 ns | 0.0236 ns | 0.0197 ns | 4.444 ns | 4.415 ns | 4.472 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:23:33][INFO] |    EnterExit | NativeAOT 8.0 | 5.019 ns | 0.0392 ns | 0.0367 ns | 5.017 ns | 4.989 ns | 5.103 ns |  1.13 |    0.01 |         - |          NA |
[2022/12/09 19:23:33][INFO] |              |               |          |           |           |          |          |          |       |         |           |             |
[2022/12/09 19:23:33][INFO] | TryEnterExit |      .NET 8.0 | 6.083 ns | 0.3853 ns | 0.4283 ns | 5.849 ns | 5.619 ns | 6.651 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:23:33][INFO] | TryEnterExit | NativeAOT 8.0 | 5.164 ns | 0.0382 ns | 0.0357 ns | 5.161 ns | 5.103 ns | 5.218 ns |  0.85 |    0.06 |         - |          NA |

@VSadov
Copy link
Member Author

VSadov commented Dec 12, 2022

Results from my microbenchmark. Just doing 1billion lock/unlock iterations and reporting time in milliseconds (smaller is better):

=== New:
Fat:
7596
Thin:
5581
Fat:
7586
Thin:
5588
Fat:
7601

== Old  (there was really no thin/fat difference, all locks were fat):
Fat:
9861
Thin:
9844
Fat:
9857
Thin:
9826
Fat:
9848

=== CoreClr:
Fat:
5287
Thin:
4772
Fat:
5332
Thin:
4794
Fat:
5279

code for reference:

using System.Diagnostics;


namespace ConsoleApp12
{
    internal class Program
    {
        static Program l1 = new Program();
        static readonly Program l2 = new Program();

        private const int iters = 1000000000;

        static void Main(string[] args)
        {
            l1 = new Program();
            l1.GetHashCode();
            for (; ; )
            {

                System.Console.WriteLine("Fat:");
                FatLock();

                System.Console.WriteLine("Thin:");
                ThinLock();
            }

            //Fairness();
        }

        static void FatLock()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iters; i++)
            {
                lock (l1)
                {

                }
            }
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void ThinLock()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iters; i++)
            {
                lock (l2)
                {

                }
            }
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void Fairness()
        {
            object l = new object();

            const int n = 32;
            Task[] workers = new Task[n];

            for (int i = 0; i < n; i++)
            {
                workers[i] = Task.Run(() =>
                {
                    for (; ; )
                    {
                        lock (l)
                        {
                            Console.Write("{00} ", Environment.CurrentManagedThreadId);

                            int ticks = Environment.TickCount;
                            while (Environment.TickCount - ticks < 50)
                                Thread.SpinWait(1);
                        }
                    }
                });
            }

            Task.WaitAll(workers);
        }

    }
}

@@ -1,7 +1,12 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

EXTERN_C DECLSPEC_THREAD ThreadBuffer tls_CurrentThread;
#ifdef _MSC_VER
// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep on extern "C" on this workaround to avoid introducing name mangling into the asm code?

Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ported the workaround from a similar change in CoreClr. I am not sure I completely understand why it is needed.
If removing extern "C" is not a part of the fix, I guess, why not. I will try.

Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix works fine with extern "C", so I will add it and revert the mangling in asm

// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized.
__declspec(selectany) __declspec(thread) ThreadBuffer tls_CurrentThread;
#else
EXTERN_C __thread ThreadBuffer tls_CurrentThread;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does non-Windows pay the penalty for TLS init checks? (ie similar workaround is not possible on non-Windows)

Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion of the original fix (#33347), I gathered that even on windows the workaround was not needed until some version of MSVC.
It is probably more a function of the c runtime than OS (glibc, musl, etc). It should be fairly easy to check what happens on glibc.
I also wonder if there are differences on arm64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux/glibc does not require a similar fix. There are no checks when accessing TLS.

Here is the disassembly for reference.

Linux/glibc x64

System.Collections.Tests`::RhGetCurrentManagedThreadId():
    0x5555554d5db0 <+0>:  pushq  %rbp
    0x5555554d5db1 <+1>:  movq   %rsp, %rbp
    0x5555554d5db4 <+4>:  movq   %fs:0x0, %rax
    0x5555554d5dbd <+13>: leaq   -0x100(%rax), %rax
    0x5555554d5dc4 <+20>: movl   0xc0(%rax), %eax
    0x5555554d5dca <+26>: popq   %rbp
    0x5555554d5dcb <+27>: retq

Linux/glibc arm64

System.Collections.Tests`::RhGetCurrentManagedThreadId():
    0xaaaaaab87400 <+0>:  stp    x29, x30, [sp, #-0x10]!
    0xaaaaaab87404 <+4>:  mov    x29, sp
    0xaaaaaab87408 <+8>:  adrp   x0, 4173
    0xaaaaaab8740c <+12>: ldr    x0, [x0, #0xa30]
    0xaaaaaab87410 <+16>: nop
    0xaaaaaab87414 <+20>: nop
    0xaaaaaab87418 <+24>: mrs    x8, TPIDR_EL0
    0xaaaaaab8741c <+28>: add    x8, x8, x0
    0xaaaaaab87420 <+32>: ldr    w0, [x8, #0xc0]
    0xaaaaaab87424 <+36>: ldp    x29, x30, [sp], #0x10
    0xaaaaaab87428 <+40>: ret

MSVC (with the fix)

    return ThreadStore::RawGetCurrentThread()->GetManagedThreadId();
00007FF6A99D7AE0  mov         ecx,dword ptr [_tls_index (07FF6A9CB3A28h)]  
00007FF6A99D7AE6  mov         rax,qword ptr gs:[58h]  
00007FF6A99D7AEF  mov         edx,10h  
00007FF6A99D7AF4  mov         rax,qword ptr [rax+rcx*8]  
00007FF6A99D7AF8  mov         eax,dword ptr [rax+rdx+0C0h]  
}
00007FF6A99D7AFF  ret  

@@ -51,11 +51,11 @@ EXTERN_C NATIVEAOT_API void __cdecl RhSpinWait(int32_t iterations)
ASSERT(iterations > 0);

// limit the spin count in coop mode.
ASSERT_MSG(iterations <= 10000 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(),
ASSERT_MSG(iterations <= 1024 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(),
Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just because we switched to SpinWait that is calibrated to 35 nsec.
Previous numbers were assumed in terms of pre-skylake "pause", or, simply put, were divided by 8 in the underlying APIs.
I also made it power of 2. It is more convenient.

@@ -51,7 +52,7 @@ internal static class SyncTable
/// </summary>
#if DEBUG
// Exercise table expansion more frequently in debug builds
private const int InitialSize = 1;
private const int InitialSize = 2;
Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The valid indices start at 2 is so that we could return 1 and 0 as true/false and the rest to mean a syncblock index for the slow path to retry with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use -1 and 0 to return the result, and still start at 1.

private static int s_unusedEntryIndex = 2;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static ref Entry EntryRef(int i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the bounds-check that expensive everywhere? I would rename it to UnsafeEntryRef and only use it in the two perf-critical places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bounds check adds some code into the caller and reads memory at arbitrary distance from the location that we need.
It makes difference only in two critical places though. We can use indexer in others.


fixed (byte* pRawData = &o.GetRawData())
fixed (MethodTable** ppMethodTable = &o.GetMethodTableRef())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning the object using GetMethodTableRef is not generally ok, but it is ok here since it is very low-level ObjectHeader code.

Copy link
Member Author

@VSadov VSadov Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning method table ptr saves computing the offset to data. We are not interested in object's data here. Yes, this is a very unusual scenario. As long as byref is within the range of the object, it is all the same to GC as it will "align" to the containing object and pin that object.

I wish the object header belonged to the object though. We would not need pinning at all.

@@ -1346,6 +1361,16 @@ COOP_PINVOKE_HELPER(uint64_t, RhCurrentOSThreadId, ())
return PalGetCurrentThreadIdForLogging();
}

COOP_PINVOKE_HELPER(int32_t, RhGetCurrentManagedThreadId, ())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once/if we do the codegen work to make thread statics fast, this change will be a de-optimization and we will go back to managed thread static.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, once threadstatics are faster, we can just use a threadstatic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we should keep using regular threadstatics for this even if it means that we are leaving some perf on the table for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using threadstatics makes the lock noticeable slower.

Straightforward way with IdNone == 0 and without CurrentManagedThreadIdUnchecked is 36% slower.
-1 adjusted way with IdNone == -1 and CurrentManagedThreadIdUnchecked is 33% slower

=== current:
Fat:
8031
Thin:
5684
Fat:
8034
Thin:
5680
Fat:
8000
Thin:
5810

=== use threadstatic without CurrentManagedThreadIdUnchecked

thin lock gets ~36% slower

Fat:
9122
Thin:
7796
Fat:
9135
Thin:
7766
Fat:
9119
Thin:
7813

=== use threadstatic with CurrentManagedThreadIdUnchecked and -1 
adjustment so that uninitialized looks like -1 

thin lock gets 33% slower

Fat:
8759
Thin:
7567
Fat:
8747
Thin:
7583
Fat:
8713
Thin:
7574

Copy link
Member Author

@VSadov VSadov Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to switch to the "threadstatic with CurrentManagedThreadIdUnchecked and -1" variant then.

It is still better than before this PR and a possible gain of at least 33% should make improvement of threadstatics more interesting.

@jkotas
Copy link
Member

jkotas commented Dec 12, 2022

Related issues:

One more codegen issue that can help this a bit: #63397

@VSadov
Copy link
Member Author

VSadov commented Dec 12, 2022

One more codegen issue that can help this a bit: #63397

Right, there is no reason for a pinned local be on stack, it is just a matter of root reporting.

I experimented with using unpinned refs - that is obviously incorrect, but failures are rare enough to measure how much we would gain. It was not much - within noise.
It may matter more once we deal with other issues, so I figured - lets push for other issues first.
Since this came up and we already have an issue, I will add it to the list.

@@ -104,7 +104,7 @@ internal static int GetNewHashCode()
// where two threads consistently give out the same hash codes.
// Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A).
t_hashSeed = t_hashSeed * multiplier + 1;
return t_hashSeed;
return t_hashSeed + Environment.TickCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this, we can just call Random.Shared.Next().

private static int s_unusedEntryIndex = 2;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static ref Entry UnsafeEntryRef(int i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to use UnsafeEntryRef for GetHashCode too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted one too many.

@VSadov
Copy link
Member Author

VSadov commented Dec 13, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Comment on lines 352 to 353
if (acquired)
Debug.Assert((_state & Locked) != 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (acquired)
Debug.Assert((_state & Locked) != 0);
Debug.Assert(!acquired || (_state & Locked) != 0);

@@ -174,7 +176,7 @@ public ImmutableIdDispenser AllocateId(out int id)
}
}

public const int IdNone = 0;
public const int IdNone = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any actual problem with 0?

Zero is a tiny bit more efficient - the instructions to check for zero tend to have smaller encodings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 is so that I could just fetch the ID unchecked on the critical paths and it would automatically go on a slow path if not yet initialized, since it would not fit into 16 bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this trick used? I am not able to find it.

Copy link
Member Author

@VSadov VSadov Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentManagedThreadIdUnchecked is not checking for initialization.

Not supposed to, unless i mixed up something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that now. Using -1 to signal "not yet initialized" state will make it harder to switch back to thread statics once we get them optimized.

Copy link
Member Author

@VSadov VSadov Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending how thread statics end up working, I think we may just initialize the ID to the right value, if that does not entail checking "is initialized" all the time.
Or try initializing ID early enough, before it can possibly be used in any lock. It may already be the case, but I am not 100% sure and did not want to force such invariant just yet.

Alternatively we may do -1 when accessing the value, then uninitialized 0 will be seen as -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread statics are initialized to 0.

The current initialization to -1 in the unmanaged runtime is piggy backing on "is initialized" check done by reverse pinvoke that is the only way to enter managed code in native AOT. Once this is switched back to managed thread static, the initialization to -1 will need the "is initialized" check all the time.

Alternatively we may do -1 when accessing the value.

Yes, I think it would be better.

{
// Use RhGetProcessCpuCount directly to avoid Environment.ProcessorCount->ClassConstructorRunner->Lock->Environment.ProcessorCount cycle
s_maxSpinCount = (RuntimeImports.RhGetProcessCpuCount() > 1) ? MaxSpinningValue : SpinningDisabled;
s_processorCount = RuntimeImports.RhGetProcessCpuCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s_processorCount = RuntimeImports.RhGetProcessCpuCount();
if (s_processorCount == 0)
s_processorCount = RuntimeImports.RhGetProcessCpuCount();

Avoid unnecessary writes of global state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense if RhGetProcessCpuCount() is expensive. I guess, we do not know, so we can do the check,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potentially most expensive part is cache-trashing caused by repeated writes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, we also do not want to overwrite and invalidate the shared state every time another lock initializes its spin count.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -246,11 +248,16 @@ public static void RecycleId(int id)
}
}

// We do -1 adjustment so that uninitialized result would be -1 and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 trick can be local to the few places in ObjectHeader.cs that need it, like: if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID). It does not need to be spread over multiple files like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that would add extra condition though. Otherwise Release can't tell if lock is owned by nobody or by a thread with uninitialized thread ID.

Copy link
Member Author

@VSadov VSadov Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When uninitialized ID is -1, itis too big to own a thin lock, so we do not need to check if the ID is initialized.
0 will match with unowned state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it can stay as is then.

Copy link
Member Author

@VSadov VSadov Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Release can apply currentThreadID |= (currentThreadID - 1) >> 31 to have the same effect as if default was -1
If we store ID in off-by-one form, we still need to do "-1", so this would be just 2 extra operations, but the default can be 0.

The Acquire can do if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID) as you suggested.

@VSadov
Copy link
Member Author

VSadov commented Dec 15, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dotnet dotnet deleted a comment from azure-pipelines bot Dec 15, 2022
@VSadov
Copy link
Member Author

VSadov commented Dec 15, 2022

Thanks!!!

@VSadov VSadov merged commit d7924cc into dotnet:main Dec 15, 2022
@VSadov VSadov deleted the thinLock branch December 15, 2022 18:56
@ghost ghost locked as resolved and limited conversation to collaborators Jan 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NativeAOT] Implement thin locks [NativeAOT] ConcurrentDictionary is slower
2 participants