Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

Merged
merged 1 commit into from
Oct 24, 2019

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Aug 22, 2019

Today async ValueTask/ValueTask<T> methods use builders that special-case the synchronously completing case (to just return a default(ValueTask) or new ValueTask<T>(result)) but defer to the equivalent of async Task/Task<T> for when the operation completes asynchronously. This, however, doesn't take advantage of the fact that value tasks can wrap arbitrary IValueTaskSource/IValueTaskSource<T> implementations.

I've had this sitting on the shelf for a while, but finally cleaned it up. The first three commits here are just moving around existing code. The last two commits are the meat of this change. This changes AsyncValueTaskMethodBuilder and AsyncValueTaskMethodBuilder<T> to use pooled IValueTaskSource/IValueTaskSource<T> instances, such that calls to an async ValueTask/ValueTask<T> method incur 0 allocations as long as there's a cached object available.

I've marked this as a draft and work-in-progress for a few reasons:

  1. There's a breaking change here, in that while we say/document that ValueTask/ValueTask<T>s are more limited in what they can be used for, nothing in the implementation actually stops a ValueTask that was wrapping a Task from being used as permissively as Task, e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken. I think this is a reasonable thing to do in a major release, but I also want feedback and a lot of runway on it. There's the potential to make it opt-in (or opt-out) as well, but that will also non-trivially complicate the implementation.
  2. Policy around pooling. This is always a difficult thing to tune. Right now I've chosen a policy that limits the number of pooled objects per state machine type to an arbitrary multiple of the processor count, and that tries to strike a balance between contention and garbage by using a spin lock and if there's any contention while trying to get or return a pooled object, the cache is ignored. We will need to think hard about what policy to use here. It's also possible it could be tuned per state machine, e.g. by having an attribute that's looked up via reflection when creating the cache for a state machine, but that'll add a lot of first-access overhead to any async ValueTask/ValueTask<T> method.
  3. Perf validation.

cc: @kouvel, @tarekgh, @benaadams

@jkotas
Copy link
Member

jkotas commented Aug 22, 2019

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking? (I understand that this will reduce the number of GCs by moving the cost elsewhere.)

@stephentoub
Copy link
Member Author

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking?

Don't know yet. I'm hoping by putting this up as a draft folks will try it out :) I'm also going to try it on a few workloads. I won't be merging it until it proves valuable.

@benaadams
Copy link
Member

Will give it a whirl :)

@benaadams
Copy link
Member

Isn't very happy

Unhandled exception. System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Sources\ManualResetValueTaskSourceCore.cs:line 163
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.ValueTaskStateMachineBox.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncValueTaskMethodBuilderT.cs:line 309
   at System.Runtime.CompilerServices.ValueTaskAwaiter`1.System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(IAsyncStateMachineBox box) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\ValueTaskAwaiter.cs:line 188
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncTaskMethodBuilderT.cs:line 88
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Task.cs:line 1910
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 880
   at System.Threading.ThreadPoolWorkQueue.Dispatch() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 700

@stephentoub
Copy link
Member Author

Isn't very happy

What is this running?

@benaadams
Copy link
Member

Isn't very happy

What is this running?

Just trying to debug into it as the stack trace clearly isn't very helpful :)

@halter73
Copy link
Member

e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken.

I say it's better to make this change sooner rather than later. I like that this change might teach people the hard way not to await the same ValueTask multiple times since it will no longer just Pipes and other IValueTaskSource-based ValueTasks that break.

Do you expect there will be any common workloads where pooling the boxes hurts perf?

@stephentoub
Copy link
Member Author

Do you expect there will be any common workloads where pooling the boxes hurts perf?

It's hard to say.

Using pooling is based on the assumption that the overheads associated with pooling (e.g. synchronization to access the pool, any cache effects that come from using the pool, etc.) are less than the overheads associated with allocation+GC. That is not always the case, but it's also hard to predict.

There's also the fact that this is changing the underlying implementation, so it's not just pooling vs non-pooling, but ValueTask wrapping Task vs wrapping IValueTaskSource, different code paths being taken internally, different synchronization used to hook up a continuation, etc. Some of that is just tuning, but some of it will be trade-offs as well. In theory this approach could have less overhead even if pooling weren't used at all (just using a custom IValueTaskSource) because it doesn't need to accomodate things like multiple continuations, but we're also talking minutia at that point.

In the end, I'll want us to vet this on real workloads that demonstrate real value before moving ahead with it.

@mburbea
Copy link

mburbea commented Aug 22, 2019

@mgravell has a pretty interesting library that does something very similar to this
https://github.com/mgravell/PooledAwait

@benaadams
Copy link
Member

Revisiting https://github.com/dotnet/coreclr/issues/21612 again 😉

Since ValueTaskStateMachineBox is a subtype would just calling it StateMachineBox work (as is done for many Enumerators)?

So rather than
AsyncValueTaskMethodBuilder'1.ValueTaskStateMachineBox.OnCompleted
It is
AsyncValueTaskMethodBuilder'1.StateMachineBox.OnCompleted

However that's not too important 😄

@benaadams
Copy link
Member

benaadams commented Aug 22, 2019

Issue I'm having I think is in websockets; as its specifically commented about it https://github.com/dotnet/corefx/blob/ba9f27788bd99a29e5d832531f0ee86d8eae4959/src/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.netcoreapp.cs#L30-L31

// WARNING: This code is only valid because ReceiveAsyncPrivate returns a ValueTask that wraps a T or a Task.
// If that ever changes where ReceiveAsyncPrivate could wrap an IValueTaskSource, this must also change.

None of the other ValueTasks seem to be having issues

@stephentoub
Copy link
Member Author

issue I'm having I think is in websockets

Yup, I wrote both that code and comment :) Shame on me. Will change it.

@mgravell
Copy link
Member

mgravell commented Aug 22, 2019

@mburbea I would be genuinely delighted if I could nuke that lib in entirety because the feature has been implemented in the core lib by folks who know better than me what they're doing.

@benaadams
Copy link
Member

256 websocket connections ~30sec

Pre
image

Post
image

Might still want to do "Amortize WebSocket.EnsureBufferContainsAsync" dotnet/corefx#39455 not sure why its not really effected by this change; however it does address the harder to amortize async methods.

Will do some benchmarks

@stephentoub
Copy link
Member Author

not sure why its not really effected by this change

Oh, simpler explanation:

private async Task EnsureBufferContainsAsync

;-)

@benaadams
Copy link
Member

Yep, changing entirely fixes it
image

@benaadams
Copy link
Member

benaadams commented Aug 22, 2019

I'm liking this very much. Even with its low(ish) pool limit the allocation growth is low

200 websocket connections ~30sec

image

1000 websocket connections ~30sec

image

And the cost for renting and returning boxes seems fairly minimal

@stephentoub
Copy link
Member Author

I'm liking this very much

Thanks. Does it have a measurable impact on your throughout?

@benaadams
Copy link
Member

benaadams commented Aug 22, 2019

The change in before and after allocations is huge 😄

1000 websocket connections ~30sec

Pre: 1.36M objects allocated (245MB)

image

Post: 209K objects allocated (20MB)

image

@benaadams
Copy link
Member

Does it have a measurable impact on your throughout?

Yes(ish); unfortunately (or fortunately), it moves it into a realm beyond what I can load test maximums accurately with my current setup as I can't generate enough load. Its also highlighted some other hot spots in our own code.

While I can't currently quantify that (though will look into it more); more interestingly; it also means for the same number of GCs and RAM usage I can pack the containers at least x4 tighter when they aren't under maximum load.

@benaadams
Copy link
Member

benaadams commented Aug 23, 2019

The overhead of GetOrCapture is less than capturing the ExecutionContext

image

Well some are more than others; but is a fairly insignificant cost

image

ReturnBox is a little more; but fairly small

image

@benaadams
Copy link
Member

@mgravell are you in a position to give this a try for your workloads? Note it only applies to async ValueTask and ValueTask (not async Task)

@benaadams
Copy link
Member

benaadams commented Aug 23, 2019

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

Similar caches were source of troubles in the unmanaged threadpool in the past. We ended up with per-processor caches, but they are still sub-optimal. The small memory blocks will lose their affinity to the right core over time and so one ends up with false sharing or cross-NUMA node memory access. The GC allocated memory avoids these issues.

Having said this, I am fine with experimenting with this to see whether somebody can demonstrate measurable throughput improvement thanks to the caching.

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

whether somebody can demonstrate measurable throughput improvement thanks to the caching

If there is a workload that shows measurable throughput improvement, I would like the GC team to take a look at the GC traces to validate that it is not hitting a GC tuning bug.

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

Another interesting metric to look at is code size. Benchmark: https://gist.githubusercontent.com/jkotas/c3913ac6a98b3474b888926aa9066dd6/raw/95c548546d3e7d183e028348a96a50181c1546ed/manytasks.cs. Native code size bytes from perfview JIT stats:

baseline: 2,603,033 bytes
new / off: 2,659,316 bytes (2% regression)
new / on: 2,810,236 bytes (8% regression)

@stephentoub
Copy link
Member Author

stephentoub commented Oct 21, 2019

~5.5% throughput regression when the cache is off

That case should be unaffected by parallelism. If you see opportunities in this PR to address that, I'd love to hear about it.

I also put your example into Benchmark.NET form:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System;
using System.Threading.Tasks;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    [Benchmark]
    public void Parallel()
    {
        var tasks = new Task[Environment.ProcessorCount * 4];
        for (int i = 0; i < tasks.Length; i++) tasks[i] = Work();
        Task.WaitAll(tasks);
    }

    static async ValueTask<int> Step(int i)
    {
        if (i >= 999) await Task.Delay(1);
        else await Task.Yield();
        return i + 1;
    }

    static async Task Work()
    {
        var r = new Random();
        for (int i = 0; i < 100000; i++)
        {
            await Step(r.Next(1000));
        }
    }
}

and I get:

Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 1.685 s 0.0161 s 0.0150 s 1.00 99000.0000 2000.0000 - 586.7 MB
Parallel \old\corerun.exe 1.692 s 0.0218 s 0.0204 s 1.00 99000.0000 2000.0000 - 586.7 MB

Maybe there's a machine-specific aspect to it? (There's also both randomness and "saturate the machine" parallelism here, so there's going to be variability.)

Note that the goal is for there to be 0 perf impact on an example like yours when the cache is off: it's meant to use exactly the same infrastructure as is used today with as little ceremony as we can muster to enable switching on the cache optionally, though there are a few tweaks to try to improve ValueTask perf even in the case where the cache is off.

Also note that a large portion of the time spent in your example is in the Task.Delay(1), which on Windows is going to incur an ~15ms delay. If I delete that part, I get:

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 541.3 ms 8.09 ms 7.57 ms 0.96 0.02 92000.0000 2000.0000 - 549.34 MB
Parallel \old\corerun.exe 562.8 ms 4.99 ms 4.67 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB

~2% throughput regression and more volatile performance when the cache is on

My machine isn't as quiet as I'd like for a parallelism test, but I just ran it three times (without the Task.Delay, which dominates) and got:

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 538.8 ms 10.68 ms 14.26 ms 0.97 0.03 46000.0000 1000.0000 - 275.13 MB
Parallel \old\corerun.exe 556.5 ms 8.63 ms 8.08 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB
Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 547.6 ms 10.70 ms 16.97 ms 0.95 0.02 47000.0000 1000.0000 - 280.52 MB
Parallel \old\corerun.exe 571.8 ms 2.94 ms 2.75 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB
Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 547.6 ms 10.83 ms 22.13 ms 0.98 0.03 49000.0000 1000.0000 - 296.21 MB
Parallel \old\corerun.exe 556.9 ms 5.75 ms 5.38 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB

Putting the Task.Delay back in I again saw basically no difference (in throughput... obviously the allocation is much different):

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 1.922 s 0.0383 s 0.0790 s 1.00 0.05 43000.0000 2000.0000 - 257.68 MB
Parallel \old\corerun.exe 1.924 s 0.0379 s 0.0729 s 1.00 0.00 99000.0000 3000.0000 - 586.73 MB

new / off: 2,659,316 bytes (2% regression)

There's a layer of helpers this is going through in order to enable the opt-in switch. If we were to switch to an always-on model, I'd expect this to go away.

new / on: 2,810,236 bytes (8% regression)

Every async ValueTask or async ValueTask<T> method is now referencing two different sets of builders, and every async ValueTask{<T>}'s builder now has the cache stuff as well. If we were to switch to an always-on model, I'd hope a good chunk of this would go away.

There's another option as well. Right now the switch flips between doing exactly what we do today (async completions are backed by Task instances) and using the cache (async completions are backed by pooled IValueTaskSource instances). I could also make it so that the switch only controls whether the cache is used, i.e. always use the IValueTaskSource instances, but if the switch is off, behave as if no instances could ever be retrieved from the cache (and don't try to return anything to it). The upside of that is async ValueTask{<T>} methods wouldn't pull in as much as they do from Task<T>, since they don't need to utilize it. The downside is the visible there-but-not-supported behaviors that Task ends up surfacing through ValueTask (e.g. .GetAwaiter().GetResult() working on incomplete ValueTasks even though we say that's not allowed) wouldn't be configurable with the switch.

Having said this, I am fine with experimenting with this to see whether somebody can demonstrate measurable throughput improvement thanks to the caching.

I'm having trouble rationalizing that with your stats and comments about GC tuning, as it sounds like your opinion is that a) this shouldn't be necessary, and b) if it is it's because of a GC perf bug that should be fixed... and then see (a). 😉

Is your preference I just close this PR?

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

The occasional Task.Delay is important in my benchmarks because of it triggers thread movement from one core to a different core that happens in real workloads. There are other ways to achieve the same thing. I have not seen a significant time spent in the Task.Delay on my dev box (32 hyperthreading / 16 physical core Xeon E5). Most of the time is spent in the threadpool enqueue. Commenting out the Task.Delay changes the overall performance by a few percent at most on my machine.

I know there is a lot of interest in building caches like these because of single-minded focus on GC allocations. I would like to see a more data on whether they are worth it. It is why I have said that I am fine with experimenting with this. This interest in building caches like these won't go away until we have this data.

Personally, I believe that this cache should not be needed as long as the GC is working properly (and the cache likely hurts performance on average in real workloads). But I do not have data to prove it.

@benaadams
Copy link
Member

In netcoreapp3.0 AsyncStateMachineBoxes are the only real source of allocations in our app:

image

For a very light load (0% CPU), they allocate 127MB every 1.5minutes and cause a GC

image

With the app stabilising around 170MB working set, 420MB Commit size (though still creeping up slowly)

image

Will report back changes, in this PR

@benaadams
Copy link
Member

Distracted by trying to find out what all the native allocations are

@benaadams
Copy link
Member

benaadams commented Oct 22, 2019

Raised question on the the 12MB of Jit allocations hanging around: https://github.com/dotnet/coreclr/issues/27357

syncblocks and threads (and threadstatics for those threads) it looks like later in the execution

image

image

private static int GetPoolAsyncValueTasksLimitValue() =>
int.TryParse(Environment.GetEnvironmentVariable("DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT"), out int result) && result > 0 ?
result :
Environment.ProcessorCount * 4; // arbitrary default value
Copy link
Member

@benaadams benaadams Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit low
suggestion~~ ~~ Environment.ProcessorCount * 32; // arbitrary default value~~ ~~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, have same allocation profile with 4 and 32.

@benaadams
Copy link
Member

Roughly same period with same load

image

image

97MB working set, 355MB Commit size though it keeps going up (likely due to the ValueTaskSourceAsTask in Websocket.

image

@benaadams
Copy link
Member

benaadams commented Oct 22, 2019

High load (~1.5 mins 4 cores 8 HT)

set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=0

image

image

set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1
set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT=256

image

image

@benaadams
Copy link
Member

Assuming the ValueTask+ValueTaskSourceAsTask<ValueWebSocketReceiveResult>* in Websocket can be avoided; perhaps by inverting the dependency of Receive and Close (and making Close a Tcs or similar)... (@stephentoub could work?)

With this change off, running under load I quickly hit 1GB working set (less than 30 secs):

image

after which it doesn't increase but will start cycling through GCs. Can cap it lower with the new options in .NET 3.0+ but that will mean it GCs faster.

With DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1 I'm somewhere very different:

image

Both have around ~275MB in actual use

image

But the original has 1GB allocated to the program and is incurring continuous GCs; the other aside from the point at the top* only has 291MB allocated to the program to do it and has no GCs; and yet both are doing heavy async network work.

The big win here is the ability not just to reduce GC allocations but to move to a zero alloc pauseless stable state; while using the standard framework methods, were the GC never needs to run and the working set remains small. One aspect of this is it allows tighter packing of processes/containers etc.

Another aspect is: like intrinsics it opens up a new set of problems for .NET Core to be applicable for (not disrespecting the amazing work the GC does); as there are areas where people reject the idea of a GC needing to periodically run.

@jkotas
Copy link
Member

jkotas commented Oct 22, 2019

Are you able to measure peak RPS that your server can handle? If yes, do the extra GCs make measurable impact on the peak RPS (GC limited to 300MB vs. this thing)?

Environment.ProcessorCount * 32

This looks like a very large limit. It means you can easily have 50kB - 100kB in caches for each ValueTask used in your program.

people reject the idea of a GC needing to periodically run.

These people should use Rust. .NET is not going to be a success story for them.

@GSPP
Copy link

GSPP commented Oct 22, 2019

@benaadams what kind of pause targets do your workloads have? How rare do pauses have to be? Essentially zero?

Games have a lot of trouble with GC pauses as well. Attaining no pauses requires an awkward coding style. A 60Hz framerate means a 16ms budget per frame. I understand you need a small heap to survive a single G0 collection with this.

My main point: Maybe we could best solve this pause issue with low-pause GC features instead of running after every possible source of garbage. G2 pauses are (in my experience) very bearable these days because it's all background.

I have a production web site with a 1GB heap. I see 10ms G2 pauses but I see 100ms G0 and G1 pauses. So those have become more of a problem now. @Maoni0 said something similar in her recent talk.

@benaadams would you be satisfied if the GC was able to cap all pauses at say 10ms? Would there still be a need to limit their frequency?

@benaadams
Copy link
Member

Environment.ProcessorCount * 32

This looks like a very large limit. It means you can easily have 50kB - 100kB in caches for each ValueTask used in your program.

Retested under high load, end up with same allocation profile with 4 or 32 so retract my increase; for lower loads end up with more allocations (maybe double); which is likely due to more async methods being simultaneously in a wait state. Which is why I originally suggested the increase. Sticking with 4 and allowing it to be configurable (as now) is probably the best of both worlds.

Games have a lot of trouble with GC pauses as well. Attaining no pauses requires an awkward coding style.

Its not very awkward anymore with .NET Core 3.0 and C#8; certainly much less awkward than with rust's borrow checker. The aim is not for the purity of no allocations, which is indeed more problem than its worth, but long sustained periods of no allocations; and this is also the approach many games take where when and where allocations happen is constrained and controlled.

This works quite well in .NET Core 3.0; except for async calls; which is intensely awkward e.g. see code changes in dotnet/corefx#39455, dotnet/aspnetcore#11942 which are fairly straightforward async methods and all the methods in the BCL need to be changed for this to work or require re-implementation externally to make non-allocating versions.

Async is also problematic as the statemachines are sat on for a while; in that they generally take some time and form long chains of statemachines, so when one propergates a GC generation it is many.

The advantage of this change is no special code is required by the implementer and it works for their methods too.

would you be satisfied if the GC was able to cap all pauses at say 10ms? Would there still be a need to limit their frequency?

That would definitely be a useful tool; much like TryStartNoGCRegion but the GC work still needs to be done and with I/O becoming heavily async focused it would be very useful if these unavoidable allocations could be avoided so GC wasn't even involved for the most part (e.g. trying to maintain a 16ms/11ms tick rate while network I/O is on going)

@benaadams
Copy link
Member

If yes, do the extra GCs make measurable impact on the peak RPS (GC limited to 300MB vs. this thing)?

Will investigate further

This looks like a very large limit

Current limit looks ok

in caches for each ValueTask used in your program.

Could drop the caches on a GC as per Gen2GcCallback? So infrequent async statemachines don't get cached forever?

@benaadams
Copy link
Member

Another interesting metric to look at is code size.

baseline: 2,603,033 bytes
new / off: 2,659,316 bytes (2% regression)
new / on: 2,810,236 bytes (8% regression)

Could guard it with something that's dropped from R2R like Avx2.IsSupported (though something better formalised as not being included for R2R would be better); then the caches would only come into play at Jit Tier1 time?

Today `async ValueTask/ValueTask<T>` methods use builders that special-case the synchronously completing case (to just return a `default(ValueTask)` or `new ValueTask<T>(result))` but defer to the equivalent of `async Task/Task<T>` for when the operation completes asynchronously. This, however, doesn't take advantage of the fact that value tasks can wrap arbitrary `IValueTaskSource/IValueTaskSource<T>` implementations.

This commit gives `AsyncValueTaskMethodBuilder` and `AsyncValueTaskMethodBuilder<T>` the ability to use pooled `IValueTaskSource/IValueTaskSource<T>` instances, such that calls to an `async ValueTask/ValueTask<T>` method incur 0 allocations (ammortized) as long as there's a cached object available.

Currently the pooling is behind a feature flag, requiring opt-in via the DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS environment variable (setting it to "true" or "1").  This is done for a few reasons:
- There's a breaking change here, in that while we say/document that `ValueTask/ValueTask<T>`s are more limited in what they can be used for, nothing in the implementation actually stops a `ValueTask` that was wrapping a `Task` from being used as permissively as `Task`, e.g. if you were to `await` such a `ValueTask` multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken. I think this is a reasonable thing to do in a major release, but I also want feedback and a lot of runway on it.
- Policy around pooling. Pooling is always a difficult thing to tune. Right now I've chosen a policy that limits the number of pooled objects per state machine type to an arbitrary multiple of the processor count, and that tries to strike a balance between contention and garbage by using a spin lock and if there's any contention while trying to get or return a pooled object, the cache is ignored. We will need to think hard about what policy to use here. It's also possible it could be tuned per state machine, e.g. by having an attribute that's looked up via reflection when creating the cache for a state machine, but that'll add a lot of first-access overhead to any `async ValueTask/ValueTask<T>` method.  For now, it's tunable via the `DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT` environment variable, which may be set to the maximum number of pooled objects desired per state machine.
- Performance validation.  Preliminary numbers show that this accomplishes its goal, having on-par throughput with the current implementation but with significantly less allocation.  That needs to be validated at scale and across a variety of workloads.
- Diagnostics.  There are several diagnostics-related abilities available for `async Task/Task<T>` methods that are not included here when using `async ValueTask/ValueTask<T>` when pooling.  We need to evaluate these (e.g. tracing) and determine which pieces need to be enabled and which we're fine omitting.

Before shipping .NET 5, we could choose to flip the polarity of the switch (making it opt-out rather than opt-in), remove the fallback altogether and just have it be always on, or revert this change, all based on experimentation and data we receive between now and then.
@stephentoub
Copy link
Member Author

stephentoub commented Oct 24, 2019

Could guard it with something that's dropped from R2R

Maybe, but that would make certain aspects of this harder and potentially more expensive. Right now I get away with things like Unsafe.As because I know everything will be either type A or type B, per the readonly static that never changes. A fallback like I think you're suggesting would mean I could no longer trust that and would need to come up with an alternative.

It might be reasonable to still do something like you suggest, but use it as a mechanism to impact cache size, e.g. only start caching boxes after N calls to the method. That would help avoid heap object bloat for rarely used methods, but it wouldn't help with the aforementioned code size issues (and could be addressed in other ways, like dropping all boxes on gen2 gcs). The best thing we can do for code size I think is decide to either delete this change or delete the fallback, so that we're not paying double the code for every such method.

@stephentoub
Copy link
Member Author

Given the feedback, the desire for more data, and the agreement that this is a reasonable way to get it and experiment more, I'm going to go ahead and merge this. I'll open an issue up track follow-ups for .NET 5. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.