Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine what to do with "async ValueTask" caching #13633

Closed
stephentoub opened this issue Oct 24, 2019 · 20 comments · Fixed by #50116
Closed

Determine what to do with "async ValueTask" caching #13633

stephentoub opened this issue Oct 24, 2019 · 20 comments · Fixed by #50116

Comments

@stephentoub
Copy link
Member

PR dotnet/coreclr#26310 added in an experimental feature, guarded by an environment variable feature flag, that uses cached objects behind the scenes to implement async ValueTask and async ValueTask<T> methods. For .NET 5, we need to decide how to proceed with this feature:

  1. Delete it
  2. Keep it as opt-in
  3. Keep it as opt-out
  4. Always on (delete the fallback)

We need more data to decide how to proceed, and in particular:

  • Are there workloads where it yields a significant performance benefit when it's enabled?
  • Are there workloads where it measurably harms performance when it's enabled?
  • What kind of impact does it have on code size in a representative app?
  • What is a good strategy to use for the employed cache?

If we decide to keep it, and especially if we decide to keep it as opt-out or always-on, we also need to validate diagnostics, in particular tracing to ensure we're not regressing anything impactful.

Functionally, there's also a behavior change associated with it. While ValueTask/ValueTask<T> are documented to only be used in very specific ways, the fact that instances produced by async methods before .NET 5 were backed by Task meant that you could get away with violating the contract. This change generally ends up requiring that the contract be enforced. We would want to ensure we had good analyzers in place to help catch any misuse: https://github.com/dotnet/corefx/issues/40739.

CALL TO ACTION:

Please try out the bits with your app and share your results: throughput, memory load, etc.

To enable the feature, set the DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS environment variable to either 1 or true.

This will only impact async ValueTask and async ValueTask<T> methods, so you may also want to look through your code to switch some internal and private async Task/async Task<T> methods to instead use ValueTask/ValueTask<T>... just make sure to only do so when the consumers abide by the rules, which is that an instance must only be consumed once: if callers are doing anything other than directly awaiting the result of the method, be careful.

@stephentoub
Copy link
Member Author

cc: @benaadams, @mgravell, @mjsabby

@benaadams
Copy link
Member

If the choice ends up as

`4 Always on (delete the fallback)

Would there be any disadvantage to keeping the parameter DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT ?

@stephentoub
Copy link
Member Author

Would there be any disadvantage to keeping the parameter DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT ?

Mainly a little more code to execute to read the environment variable. And we'd want to ensure it was clearly "unsupported" to enable us to change the algorithms employed in the future, assuming it was still relevant anyway by the time we shipped.

@mgravell
Copy link
Member

mgravell commented Oct 24, 2019

I'll have a go at getting the PR working locally.

Just to mention another option re making the on/off decision; potentially this could also be done in code via the builder; the current API for picking the builder is awkward, but: (pure imaginary code):

[Pooled]
async ValueTask<...> Foo() {...}

This does, however, obviously hugely limit where it would apply. For better or for worse.

@mgravell
Copy link
Member

I see the PR is merged; presumably that means I can use the nightly for this; sorry if this is a silly question, but is that the 5.0.x? the 3.1.x? neither? both?

@stephentoub
Copy link
Member Author

stephentoub commented Oct 24, 2019

Just to mention another option re making the on/off decision; potentially this could also be done in code via the builder

The two options I've explored here are:

  1. Additional language/compiler support that would enable putting an attribute on methods to change the builder that's used at compile-time by Roslyn: Proposal: Allow [AsyncMethodBuilder(...)] on methods csharplang#1407.
  2. Use reflection in the single builder to see whether an attribute is applied and use it (and potentially arguments on it) to configure behavior. This adds reflection on the first-use of any such async method, which is a non-trivial cost, especially on a startup path.

And as you say, it hugely limits applicability. If we decide that enabling this for all async ValueTask{<T>} methods is untenable, then I think we'd need (1) above to proceed with this (cc: @MadsTorgersen).

Regardless, yeah, it's another option. Thanks.

I see the PR is merged; presumably that means I can use the nightly for this; sorry if this is a silly question, but is that the 5.0.x? the 3.1.x? neither? both?

5.0.x. I just checked the latest from https://dotnetcli.blob.core.windows.net/dotnet/Sdk/master/dotnet-sdk-latest-win-x64.exe and it's not there yet, but I expect it should be soon.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 28, 2019

@mgravell
Copy link
Member

mgravell commented Nov 4, 2019

finally had some time to try and look at this, but it looks like the installer is internally inconsistent at the moment - the SDK is installing 5.0.100-alpha1-015573, but the runtime is complaining that

The framework 'Microsoft.NETCore.App', version '5.0.0-alpha1.19521.2' was not found.

reporting that 5.0.0-alpha.1.19554.1 at [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] is available. I'll try again with tomorrow's daily. Sorry.

@stephentoub
Copy link
Member Author

Thanks, @mgravell. Did you configure a nuget.config ala the example from https://github.com/dotnet/core-sdk/blob/master/README.md#installers-and-binaries?

@benaadams
Copy link
Member

Initial test (code) for HttpClient with http; its quite light on async depth, Tasks aren't the dominant form of allocation and they can't be completely removed (public api); however it does make a difference.

On of the things I like about this pooling is the lifetime of async statemachines is very indeterminate compared to a lot of other allocations. They aren't long life in terms of cpu, but are in terms of elapsed time which means they end up moving to higher GC generations perhaps more than the "sync" style of async programming would suggest.

master (Task)

Threads: 1, Request/s: 11,478.0, Time: 87,122 ms, Allocated/Request: 3,200
Threads: 2, Request/s: 24,753.7, Time: 40,398 ms, Allocated/Request: 3,200
Threads: 3, Request/s: 40,130.3, Time: 24,918 ms, Allocated/Request: 3,200
Threads: 4, Request/s: 50,418.6, Time: 19,833 ms, Allocated/Request: 3,199

PR #31623 (ValueTask) SET DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=0

Threads: 1, Request/s: 11,308.8, Time: 88,426 ms, Allocated/Request: 3,200
Threads: 2, Request/s: 24,235.0, Time: 41,262 ms, Allocated/Request: 3,200
Threads: 3, Request/s: 39,305.8, Time: 25,441 ms, Allocated/Request: 3,200
Threads: 4, Request/s: 50,888.5, Time: 19,650 ms, Allocated/Request: 3,199

PR #31623 (ValueTask) SET DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1

Threads: 1, Request/s: 11,424.4, Time: 87,531 ms, Allocated/Request: 3,024
Threads: 2, Request/s: 25,417.1, Time: 39,343 ms, Allocated/Request: 3,024
Threads: 3, Request/s: 41,144.7, Time: 24,304 ms, Allocated/Request: 3,024
Threads: 4, Request/s: 52,013.2, Time: 19,225 ms, Allocated/Request: 3,025
"configProperties": {
  "System.GC.Server": true,
  "System.GC.HeapHardLimit": 20971520
}

@mgravell
Copy link
Member

mgravell commented Feb 6, 2020

I wasn't able to get my previous bits updated to .NET 5 (too many moving parts), however: here's an update that I did using RESP using @davidfowl's "Bedrock" pieces (with some minor tweaks to make it run on the .NET 5 version of the Kestrel API - just the new KestrelServer(...) signature); the test is to send and receive 25k PING/PONG message with a local redis server:

Baseline (environment variable 0):

Name Total (Allocations)

  • System.Runtime.CompilerServices.AsyncTaskMethodBuilder1.AsyncStateMachineBox1 73808
  • System.Threading.QueueUserWorkItemCallbackDefaultContext`1 24601
  • (and then small-hundreds-level things from general app startup)
c:\Code\Bedrock.Resp\tests\SimpleClient>dotnet run -c Release
Enabled: 0
ExecuteBedrock: time for 25000 ops (val-type): 2162ms

c:\Code\Bedrock.Resp\tests\SimpleClient>dotnet run -c Release
Enabled: 0
ExecuteBedrock: time for 25000 ops (val-type): 2171ms

Comparison (environment variable 1):

Name Total (Allocations)

  • System.Threading.QueueUserWorkItemCallbackDefaultContext`1 24909
  • (and then small-hundreds-level things from general app startup)
c:\Code\Bedrock.Resp\tests\SimpleClient>dotnet run -c Release
Enabled: 1
ExecuteBedrock: time for 25000 ops (val-type): 2166ms

c:\Code\Bedrock.Resp\tests\SimpleClient>dotnet run -c Release
Enabled: 1
ExecuteBedrock: time for 25000 ops (val-type): 2171ms

So: no noticeable negative impact on perf, but a big reduction in allocs. The QueueUserWorkItemCallbackDefaultContext is a known thing that is being handled separately. Basically, this makes things zero-alloc as expected. Nice work, if we can keep it!

(code is available from https://github.com/mgravell/Bedrock.Resp/tree/NET5)


One unrelated thing that is confusing me; I get significantly better performance if I Ctrl+F5 in the IDE (in release mode) vs dotnet run -c Release - down to 1750ish milliseconds; no idea why!

@stephentoub
Copy link
Member Author

stephentoub commented Feb 6, 2020

Thanks, @mgravell!

One of the issues here is that allocations are for the most part a secondary indication of perf, in that the hope is that by reducing allocation, you improve throughput. But here to reduce allocation, we're pooling, which has its own costs. So a decrease in allocations that doesn't improve throughput at all isn't necessarily a win, especially if it brings with it other downsides.

It'd be great if you can try out real workloads when you get a chance to see if there's any meaningful gain.

@davidfowl
Copy link
Member

Does the working set get better as a result? You should be able to run dotnet counters before and after to get a glimpse of what the metics look like.

@stephentoub
Copy link
Member Author

(Working set is good to look at it, but working set improvements could also instead suggest tweaks to GC configuration, as such pooling should primarily help working set if the GC has determined it's not worth doing a collection yet. Even so, as David suggests, it's a good data point.)

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2020
@YairHalberstadt
Copy link
Contributor

Thought I'd add a link to your blogpost here @stephentoub:

https://devblogs.microsoft.com/dotnet/async-valuetask-pooling-in-net-5/

@stephentoub
Copy link
Member Author

Thanks, @YairHalberstadt.

@aL3891
Copy link

aL3891 commented Mar 17, 2020

Coming from the blog post, this is really really cool :)

I haven't done any measurements but I can see this being beneficial for some places in our code, but I'd be worried to enable it globally. Perhaps it could be enabled/disabled with an attribute, that way you could choose the scope (method, class,assembly) and then down the line if we want it do be the default it could be a default msbuild parameter that works the same as the other assembly info attributes

@stephentoub
Copy link
Member Author

stephentoub commented Mar 17, 2020

Thanks, @aL3891. Technically it could be done in a manner you suggest, and I considered something like that. The main problem I hit is it adds a non-trivial amount of expensive reflection on first use of every async method (whether it uses the attribute or not), and that can contribute non-trivially to things like start-up time costs. If we could figure out how to avoid that, it might be a very reasonable opt-in path to consider.

@aL3891
Copy link

aL3891 commented Mar 17, 2020

I was thinking the attribute could be checked/used at compile time similar to CallerMemberNameAttribute, burning it into the generated code (atleast that's how I think that attribute works, apologies if I'm incorrect)
Perhaps that is not possible/practical though

Still, a very cool feature for micro service apps with very chatty APIs

Perhaps another option is adding a new PooledValueTask that is essentially just a marker for saying this task uses pooling, that would also sidestep backcompat issues, but that's also another type to keep track of :)

@stephentoub
Copy link
Member Author

For .NET 5, the plan is to leave the code for this in but off by default.

@stephentoub stephentoub modified the milestones: 5.0.0, 6.0.0 Jun 28, 2020
jkotas added a commit to jkotas/runtimelab that referenced this issue Oct 26, 2020
This is experimental feature that adds about 1.7kB in binary footprint per method returning ValueTask.

Ifdef it out for native AOT until we figure out what to do with it. See dotnet/runtime#13633.
MichalStrehovsky pushed a commit to dotnet/runtimelab that referenced this issue Oct 27, 2020
This is experimental feature that adds about 1.7kB in binary footprint per method returning ValueTask.

Ifdef it out for native AOT until we figure out what to do with it. See dotnet/runtime#13633.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 23, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 31, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants