Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

Merged
merged 2 commits into from
Jul 20, 2021

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jul 19, 2021

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

This is on my 12-logical core box:

Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Allocated
NonPooling \main\CoreRun.exe 4.314 s 0.0795 s 0.1005 s 1.00 1933000.0000 483000.0000 11,800,056 KB
NonPooling \pr\corerun.exe 4.284 s 0.0188 s 0.0167 s 0.99 1933000.0000 483000.0000 11,800,063 KB
Pooling \main\CoreRun.exe 3.010 s 0.0452 s 0.0423 s 1.00 - - 323 KB
Pooling \pr\corerun.exe 2.874 s 0.0452 s 0.0423 s 0.95 - - 203 KB
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Diagnosers;
using System.Runtime.CompilerServices;

[MemoryDiagnoser]
public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private const int Concurrency = 256;
    private const int Iters = 100_000;

    [Benchmark]
    public Task NonPooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        static async ValueTask A() => await B().ConfigureAwait(false);

        static async ValueTask B() => await C().ConfigureAwait(false);

        static async ValueTask C() => await D().ConfigureAwait(false);

        static async ValueTask D() => await Task.Yield();
    }

    [Benchmark]
    public Task Pooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask A() => await B().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask B() => await C().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask C() => await D().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask D() => await Task.Yield();
    }
}

@stephentoub stephentoub added this to the 6.0.0 milestone Jul 19, 2021
@ghost
Copy link

ghost commented Jul 19, 2021

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info in area-owners.md if you want to be subscribed.

Issue Details

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Allocated
NonPooling \main\CoreRun.exe 4.314 s 0.0795 s 0.1005 s 1.00 1933000.0000 483000.0000 11,800,056 KB
NonPooling \pr\corerun.exe 4.284 s 0.0188 s 0.0167 s 0.99 1933000.0000 483000.0000 11,800,063 KB
Pooling \main\CoreRun.exe 3.010 s 0.0452 s 0.0423 s 1.00 - - 323 KB
Pooling \pr\corerun.exe 2.874 s 0.0452 s 0.0423 s 0.95 - - 203 KB
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Diagnosers;
using System.Runtime.CompilerServices;

[MemoryDiagnoser]
public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private const int Concurrency = 256;
    private const int Iters = 100_000;

    [Benchmark]
    public Task NonPooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        static async ValueTask A() => await B().ConfigureAwait(false);

        static async ValueTask B() => await C().ConfigureAwait(false);

        static async ValueTask C() => await D().ConfigureAwait(false);

        static async ValueTask D() => await Task.Yield();
    }

    [Benchmark]
    public Task Pooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask A() => await B().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask B() => await C().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask C() => await D().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask D() => await Task.Yield();
    }
}
Author: stephentoub
Assignees: -
Labels:

area-System.Threading.Tasks, tenet-performance

Milestone: 6.0.0

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It also reduces concerns about larger machines being more negatively impacted by the caching

To validate that you could use this template, modify it and run the benchmarks with and without your changes using the AMD (32 cores), ARM (48 cores), and Mono machine (56 cores).

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention.  The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core.  This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects.  As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention.  It also reduces concerns about larger machines being more negatively impacted by the caching.  Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).
@stephentoub stephentoub merged commit 776053f into dotnet:main Jul 20, 2021
@stephentoub stephentoub deleted the tlsprocpool branch July 20, 2021 22:06
@ghost ghost locked as resolved and limited conversation to collaborators Aug 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants