-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Avoid boxing allocations for async in Tier0 #22984
Avoid boxing allocations for async in Tier0 #22984
Conversation
Do we care about allocations in tier 0? If so, wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0? |
Do they disappear after JIT compiles the method to Tier1? |
These allocations show up in all of our benchmarking scenarios. I do care about them but I don’t really understand when the optimizations that would remove them would kick in |
That would seem to suggest either:
There are lots of places allocations show up with optimizations disabled; I'm missing why we're focusing just on this method. What's special here? |
cc: @kouvel |
Yes; but as shown in the example that can be after 5MB of allocations (per await)
Because it isn't really one method, its every async method/awaiter |
i.e. its these tests causing the allocations in Tier0 |
Which also means this is forcing every such instantiation in the whole program into tier 1. This cuts both ways. I would still like to understand:
|
Probably fine as you are going down an async pathway?
Tier0 makes particularly bad code for this method (676 bytes + allocations)
Whereas Tier1 makes quite good code (43 bytes, no allocations)
and that difference will be repeated for every async method; until it hits the Tiering threshold. Outputting only this method And running a single request into the plaintext app (so as not to hit Tier0 and Tier1) outputs 293KB of asm for Tier0 and 60KB of asm with tiering off which is an 80% reduction in code generated; and its a fairly simple app with minimal async. Tier1 looks like this ; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 32
G_M14525_IG01:
56 push rsi
4883EC20 sub rsp, 32
488BF2 mov rsi, rdx
G_M14525_IG02:
498BD0 mov rdx, r8
E820D4BFFF call AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
488BD0 mov rdx, rax
4C0FBE06 movsx r8, byte ptr [rsi]
440FB64608 movzx r8, byte ptr [rsi+8]
488B0E mov rcx, gword ptr [rsi]
E8EC9CBF5D call TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
90 nop
G_M14526_IG03:
4883C420 add rsp, 32
5E pop rsi
C3 ret
; Total bytes of code 43, prolog size 5 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this Whereas Tier0 looks like this ; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 160
G_M14508_IG01:
55 push rbp
57 push rdi
56 push rsi
4881ECA0000000 sub rsp, 160
C5F877 vzeroupper
488DAC24B0000000 lea rbp, [rsp+B0H]
488BF1 mov rsi, rcx
488DBD78FFFFFF lea rdi, [rbp-88H]
B91E000000 mov ecx, 30
33C0 xor rax, rax
F3AB rep stosd
488BCE mov rcx, rsi
4889A570FFFFFF mov qword ptr [rbp-90H], rsp
48894D10 mov bword ptr [rbp+10H], rcx
48895518 mov bword ptr [rbp+18H], rdx
4C894520 mov bword ptr [rbp+20H], r8
G_M14508_IG02:
488B4D10 mov rcx, bword ptr [rbp+10H]
488B5520 mov rdx, bword ptr [rbp+20H]
E88576FFFF call AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
488945E8 mov gword ptr [rbp-18H], rax
488D55D8 lea rdx, bword ptr [rbp-28H]
G_M14508_IG03:
C5F857C0 vxorps xmm0, xmm0
C5FA7F02 vmovdqu qword ptr [rdx], xmm0
G_M14508_IG04:
488D55D8 lea rdx, bword ptr [rbp-28H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E8821B6C5F call CORINFO_HELP_BOX
4885C0 test rax, rax
745B je SHORT G_M14508_IG06
488B5518 mov rdx, bword ptr [rbp+18H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E86A1B6C5F call CORINFO_HELP_BOX
488945A0 mov gword ptr [rbp-60H], rax
488B55A0 mov rdx, gword ptr [rbp-60H]
48B978BE17CEF77F0000 mov rcx, 0x7FF7CE17BE78
E8B3056C5F call CORINFO_HELP_ISINSTANCEOFINTERFACE
4885C0 test rax, rax
742C je SHORT G_M14508_IG06
488B4D18 mov rcx, bword ptr [rbp+18H]
E82DEFFFFF call Unsafe:As(byref):byref
488945D0 mov bword ptr [rbp-30H], rax
488B4DD0 mov rcx, bword ptr [rbp-30H]
488B09 mov rcx, gword ptr [rcx]
488B55E8 mov rdx, gword ptr [rbp-18H]
41B801000000 mov r8d, 1
E8036DBAFF call TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
90 nop
G_M14508_IG05:
488D65F0 lea rsp, [rbp-10H]
5E pop rsi
5F pop rdi
5D pop rbp
C3 ret
G_M14508_IG06:
488D55D8 lea rdx, bword ptr [rbp-28H]
G_M14508_IG07:
C5F857C0 vxorps xmm0, xmm0
C5FA7F02 vmovdqu qword ptr [rdx], xmm0
G_M14508_IG08:
488D55D8 lea rdx, bword ptr [rbp-28H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E8031B6C5F call CORINFO_HELP_BOX
4885C0 test rax, rax
745E je SHORT G_M14508_IG10
488B5518 mov rdx, bword ptr [rbp+18H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E8EB1A6C5F call CORINFO_HELP_BOX
48894598 mov gword ptr [rbp-68H], rax
488B5598 mov rdx, gword ptr [rbp-68H]
48B958453ECEF77F0000 mov rcx, 0x7FF7CE3E4558
E834056C5F call CORINFO_HELP_ISINSTANCEOFINTERFACE
4885C0 test rax, rax
742F je SHORT G_M14508_IG10
488B4D18 mov rcx, bword ptr [rbp+18H]
E89EEEFFFF call Unsafe:As(byref):byref
488945C8 mov bword ptr [rbp-38H], rax
4C8B45C8 mov r8, bword ptr [rbp-38H]
450FB64008 movzx r8, byte ptr [r8+8]
488B4DC8 mov rcx, bword ptr [rbp-38H]
488B09 mov rcx, gword ptr [rcx]
488B55E8 mov rdx, gword ptr [rbp-18H]
E8816CBAFF call TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
90 nop
G_M14508_IG09:
488D65F0 lea rsp, [rbp-10H]
5E pop rsi
5F pop rdi
5D pop rbp
C3 ret
G_M14508_IG10:
488D55D8 lea rdx, bword ptr [rbp-28H]
G_M14508_IG11:
C5F857C0 vxorps xmm0, xmm0
C5FA7F02 vmovdqu qword ptr [rdx], xmm0
G_M14508_IG12:
488D55D8 lea rdx, bword ptr [rbp-28H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E8811A6C5F call CORINFO_HELP_BOX
4885C0 test rax, rax
7479 je SHORT G_M14508_IG14
488B5518 mov rdx, bword ptr [rbp+18H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E8691A6C5F call CORINFO_HELP_BOX
48894590 mov gword ptr [rbp-70H], rax
488B5590 mov rdx, gword ptr [rbp-70H]
48B950363ECEF77F0000 mov rcx, 0x7FF7CE3E3650
E8B2046C5F call CORINFO_HELP_ISINSTANCEOFINTERFACE
4885C0 test rax, rax
744A je SHORT G_M14508_IG14
G_M14508_IG13:
488B5518 mov rdx, bword ptr [rbp+18H]
48B9D0C358CEF77F0000 mov rcx, 0x7FF7CE58C3D0
E83A1A6C5F call CORINFO_HELP_BOX
48894588 mov gword ptr [rbp-78H], rax
488B5588 mov rdx, gword ptr [rbp-78H]
48B950363ECEF77F0000 mov rcx, 0x7FF7CE3E3650
E843056C5F call CORINFO_HELP_CHKCASTINTERFACE
48894580 mov gword ptr [rbp-80H], rax
488B4D80 mov rcx, gword ptr [rbp-80H]
488B55E8 mov rdx, gword ptr [rbp-18H]
49BB0810FACDF77F0000 mov r11, 0x7FF7CDFA1008
3909 cmp dword ptr [rcx], ecx
FF155D5CA9FF call [IStateMachineBoxAwareAwaiter:AwaitUnsafeOnCompleted(ref):this]
EB2F jmp SHORT G_M14508_IG15
G_M14508_IG14:
488B4DE8 mov rcx, gword ptr [rbp-18H]
49BB0010FACDF77F0000 mov r11, 0x7FF7CDFA1000
3909 cmp dword ptr [rcx], ecx
FF153D5CA9FF call [IAsyncStateMachineBox:get_MoveNextAction():ref:this]
48898578FFFFFF mov gword ptr [rbp-88H], rax
488B9578FFFFFF mov rdx, gword ptr [rbp-88H]
488B4D18 mov rcx, bword ptr [rbp+18H]
E83E3DFFFF call ConfiguredTaskAwaiter:UnsafeOnCompleted(ref):this
EB00 jmp SHORT G_M14508_IG15
G_M14508_IG15:
488D65F0 lea rsp, [rbp-10H]
5E pop rsi
5F pop rdi
5D pop rbp
C3 ret
G_M14508_IG16:
55 push rbp
57 push rdi
56 push rsi
4883EC30 sub rsp, 48
C5F877 vzeroupper
488B6920 mov rbp, qword ptr [rcx+32]
48896C2420 mov qword ptr [rsp+20H], rbp
488DADB0000000 lea rbp, [rbp+B0H]
G_M14508_IG17:
488955A8 mov gword ptr [rbp-58H], rdx
488B4DA8 mov rcx, gword ptr [rbp-58H]
48894DC0 mov gword ptr [rbp-40H], rcx
488B4DC0 mov rcx, gword ptr [rbp-40H]
33D2 xor rdx, rdx
E8D36ABAFF call Task:ThrowAsync(ref,ref)
488D05C0FFFFFF lea rax, G_M14508_IG15
G_M14508_IG18:
4883C430 add rsp, 48
5E pop rsi
5F pop rdi
5D pop rbp
C3 ret
G_M14508_IG19:
55 push rbp
57 push rdi
56 push rsi
4883EC30 sub rsp, 48
C5F877 vzeroupper
488B6920 mov rbp, qword ptr [rcx+32]
48896C2420 mov qword ptr [rsp+20H], rbp
488DADB0000000 lea rbp, [rbp+B0H]
G_M14508_IG20:
488955B0 mov gword ptr [rbp-50H], rdx
488B4DB0 mov rcx, gword ptr [rbp-50H]
48894DB8 mov gword ptr [rbp-48H], rcx
488B4DB8 mov rcx, gword ptr [rbp-48H]
33D2 xor rdx, rdx
E8936ABAFF call Task:ThrowAsync(ref,ref)
488D0580FFFFFF lea rax, G_M14508_IG15
G_M14508_IG21:
4883C430 add rsp, 48
5E pop rsi
5F pop rdi
5D pop rbp
C3 ret
; Total bytes of code 676, prolog size 62 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this |
The whole point of tier 0 is to be fast; it's expected to often produce poor CQ. |
Every little bit counts. We're also trying to figure out how to reduce allocations generally across the stack for other reasons (like more interesting GC modes). We've spent lots of time on reducing allocations in the core and it's paid off massively in-terms of performance. We're almost at 1M RPS more than we were in .NET Core 2.2 and that's not in the platform level benchmark, that's the real stack with dependency injection and the HttpContext etc. I also dislike the fact that these show up more than string allocations when doing "the obvious" benchmark.
We can flip the question then, does enabling this optimization have a huge impact on startup time? |
Tier 0? Yes. That's pretty much its whole reason to exist.
In tier 0? Why area we measuring tier 0 throughput/allocations? Why are we piecemeal optimizing tier 0 throughput/allocations by opting out specific methods? If throughput/allocations there matter, why not disable tier 0 entirely rather than piecemeal? |
Tier0 doesn't particularly increase allocations in the framework? Its just this method where it does. As precedent |
I assume @davidfowl was referring to specifically this change, rather than switching off Tiering for everything? |
What's the user advice? "When using dotTrace (or other tool) to look at your program's allocations under load ignore the 76k boxed ValueTaskAwaiters as they go away after a while" |
Turn off tiered compilation, or don't measure the first N requests. |
@jkotas, what is the meaning of |
I’m currently turning it off to get a valid representation of what’s actually being allocated and I think that’s the problem. I was referring to this specific optimization related to ValueTask with an IValueTaskSource, not anything more generic than that. |
The problem being that you have to? Then let me reiterate my statement from earlier: "wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0? This is not the only method that incurs boxing allocations in tier 0 that are eliminated in tier 1. We shouldn't be sprinkling these attributes on every such method. If the concern is the ability to measure allocations and setting an environment variable is too onerous, then we should address the root problem. Further, developers doing benchmarking need to avoid measuring startup costs, regardless of tiering. |
I don't have a problem with that, what needs to happen to do that?
This is the one that shows up in the default ASP.NET Core scenario and any scenario using pipelines really, or network stream or anything that uses an IValueTaskSource backed ValueTask. I really dislike the fact that this will show up by default when bench marking your ASP.NET Core application and I don't think it's reasonable to tell customers to turn off tier0 if you want to reduce these allocations. Maybe there are a ton of places allocations show up in tier0, we should make sure those are all fine but this is the one that is at the top of every ASP.NET Core project by default. That said, I have no idea what research went into deciding which optimizations on or off nor do I know what applications were used or what stats were observed. |
Yes; but that is "Future" |
It is likely that AggressiveOptimization is not going to mean Tier1 in future: https://github.com/dotnet/corefx/issues/32235#issuecomment-424068621
This is not a good precedent. #22191 enabled AggressiveOptimization for a small number of methods only. This change is enabling AggressiveOptimization for many methods (generic code stamping).
Yes, the startup time is what needs to be measured for optimizations like this. The allocation rate is a secondary metric.
Agree that this would be a better way to fix this. If it helps, we may consider introducing a helper method (can be internal for now) to hint the JIT that it is important to devirtualize and unbox-eliminate certain callsites, e.g. |
I will look into what it would take to enable the box optimizations for Tier0. I'm not sure what the dependence closure for this is just yet. Seems like we will also need to enable ref class tracking, type intrinsic expansions, and type test optimization. These are all plausible candidates for enabling in Tier0 by default anyways: they happen early, do not require a lot of in-depth analysis, and should generally make the jit run faster (or at least, no slower), And when they fire, the the the jit will produce smaller and faster jitted code. If we also need inlining to get at these cases, then perhaps some form of hinting will be needed. |
For this particular method Inlines into 06003583 AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
[0 IL=0002 TR=000003 06003584] [FAILED: too many il bytes] AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
[1 IL=0043 TR=000052 06005608] [aggressive inline attribute] Unsafe:As(byref):byref
[0 IL=0057 TR=000063 0600365D] [FAILED: unprofitable inline] TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this |
I have something that seems to work passably well. Preview here: EnableBoxingOptsTier0. Enabling this in tier 0 requires
Inlining is not required to optimize the async state machine box cases. On the example code from #14698:
Broader Tier0 PMI (using the new The regression in Based on partial PMI results, it looks roughly like a ~1.5% aggregate code size decrease (biased somewhat by large size decreases for in Vector and HW intrinsic code). So something along these lines should produce smaller and faster code, jit a bit faster, and allocate less on the heap. Will attempt some throughput measurements soon. |
Issue was that somehow in lower we retype
worked around it for now by also enabling "tier0" pmi diffs:
Looked into a few of the regressions and they're all from the inline box expansion; the worst cases the box has a lot of GC refs and we expand into a series of helper calls to copy these over. |
I have looked at the code size and jit throughput impact of always generating Tier1 code for
Average increase in jit time per method from Tier1 is 25%, but is small in absolute terms, around 70 microseconds per method. The Tier1 codegen is considerably more compact, around 100 bytes on average versus 660 bytes. The prototype Tier0 + opts referred to above would likely jit even faster (~25%?) than the current Tier0 and would produce somewhere around 300 bytes of code per method. For the entire 3.0 cycle, Tier0 and minopts have generated the same code, and we have done a lot of testing on this combined mode. Adding optimizations to Tier0 would make Tier0 diverge from minopts, and we would have to increase the scope of our testing considerably to feel confident the now divergent Tier0 jit codegen mode was high quality (basically forcing Tier0 codegen for all jitting, running through all our stress matrix, etc). So general consensus on the jit team is that it is too late in the cycle to add optimizations to Tier0. My suggestion is that we consider taking the change in this PR for 3.0 as a targeted fix for this set of problematic boxes, and that we pursue the more general fix of enabling Tier0 box removal optimizations -- and perhaps more -- in a subsequent release where we have more time to evaluate and test. |
Right, there are several different patterns we need to optimize here:
The first two could be handled by lookahead. The third is more complicated. So we could reduce allocations somewhat via simple pattern matches, but could not entirely eliminate them. |
This one is handled by pattern match already.
I think the pattern match is worth doing for the first two cases. I will think about the third one. |
How do we prove there's no regression? |
Should also turn off the DiagnosticSource provider. Note you can also force PerfView to only give what you really need (rather than subtracting what you don't like from the defaults. For exampl
Should give you just CPU information and some coarse (every 100K, low overhead) memory allocation information. (you can reduce it to just CPU by removing the GC+Stack in the /clrEvents). You can try to add /dotnetalloc back and it should work but I have not tried that combination. |
I would like to summarize the issue to this point (if only for my own benefit). First we now have some good data that Ben collected,, namely https://aoa.blob.core.windows.net/aspnet/PerfViewData3.etl.zip which shows the whole end-to-end (with the JITTing and tiering, and allocations, etc), so it does tell us quite a bit. Note that this data should NOT be used to judge CPU costs (only memory allocations and general flow), as we KNOW that you have to turn off MANY of PerfVIew's defaults to get unperturbed CPU traces (see above). However I don't think we need that. The scenario is PlatformBenchmarks.exe (very simple ASP.NET Core app). It runs for about 34 seconds, and by about 7 seconds after process launch it has Tier-1 Jitted enough methods that the allocations that started this are optimized away and life is good. The concern is that 7 seconds is alot of time, and that we don't like all the allocations in the first 7 seconds. Note that in this example the amount of allocation you can avoid adds up to about 5 to 6MB (not that big really), This is dominated by a single type ValueTaskAwaiter (thus a single optimization will fix this instance). OK, that is the data. Why do we Care, what is surprising? What should we do?
So what to do?
And frankly we can decide that this just is not a big enough problem (but I would argue we should still do (1)). For what it is worth... |
There's also the aspect that every async method in a .NET Core 3.0 app will allocate these boxes the first N times it is called. It isn't just at startup, it depends on the flow of the program and when the methods are called. |
The 100 ms is a window, as long as a new method is called in the last 100 ms it will keep delaying tier 1 activities. So the total delay will last as long as new methods are being called frequently. The reason for this is that call counting (especially) and background tier 1 jitting (to some degree) have significant overhead together, and allowing tier 1 jitting to happen during those startup-like activities was slowing down startup time considerably. It should be possible to decrease the delay some amount by making call counting cheaper, but it will probably take more informed heuristics to eliminate the delay without sacrificing startup time, for example jitting a method that is totally worth it during startup should be ok, as the JIT time would be very quickly compensated for, but tier 1 jitting a method that does not give much perf gain though it is called frequently during startup, may not be worth the overhead during startup. There is an environment config option |
It is not as simple as a resettable 100 msec window. If you look at https://aoa.blob.core.windows.net/aspnet/PerfViewData3.etl.zip you will see a (Tier 0) JIT event at 4,949.143 and then one at 5,596.865 (over 500 msec later) but not Tier1 JItting happening. Then there is a Tier 1 JIT at 6,909.603 (over a second later!) and then again nothing until a Tier 0 JIT at 8,815.401 (1.8 seconds later!). Now this behavior may be explainable based on App Behavior, but it is not obvious on the surface, and my understanding of this app is that it is pretty simple (it does a small 'hello world' like operation MANY times), which suggests it is not the app that is causing JIT's to 'spread out' like this. It seems worthy of an investigation. |
@vancem Thank you for writing down the summary. We may want to move the discussion about the heuristic problem and the boxing problem into separate issues.
I agree with general, but I have problem with this attribute being applied to this specific method. This specific method is among the top most frequently instantiated generic method ever. It feels like a wrong trade-off to disable Tier0 for all instantiations of this method when only very few frequently called instantiations really need it. If this was a regular non-generic method, I would be perfectly fine with this PR.
This pattern-matching based optimization does exist for the unoptimized code already for the box+unbox case and it is very simple. I am proposing to extend this pattern match for few more cases to handle the first two cases above. |
For Ben's V3 trace I think the initial pause in jitting may just be the way he ran things -- per instructions above after launching the app one must switch to a new window to run Small vertical lines are at 100ms intervals. Not sure yet how to account for the gaps in the tier1 method production rate; Would be nice to have some Tier1 queue events: start, stop, and for stop, because of quue empty or worker thread yield. |
Sounds likely 😄 |
What does Tier1 use to Jit? Does it use the regular threadpool? If so its quite heavily loaded, so might just be the number of items in the threadpool queue before it. Or does it use its own threads? |
It uses a thread pool thread, and yields it after at least 50 ms of work has been done and queues another work item. It's a native work item though, so it doesn't necessarily fall behind the managed work queue. I'll take a closer look at what's causing those gaps, if that's the issue it should be fairly easy to switch to a use a separate thread. |
Here is some raw data from instrumenting tiering. In this particular run, I waited a while after server start before starting the load. The timings are similar when starting the load immediately after server start. Legend for column "Tiering event span (ms)"
Legend for column "Tier 1 jit event span (ms)"
Notes:
I'm not able to reproduce the latency issue locally: Tier=off
Tier=on
There is some error in RPS but these are somewhat representative, and the latency and latency standard deviation are fairly stable and reproducible.
|
I am not able to reproduce this result. I see up to ~2x improvement in jit time with this change, the worst cases are shared generics that run up to 10% slower in extreme cases. I have created a benchmark to allow me to measure the JIT time for different combinations in isolation: https://gist.github.com/jkotas/102dc708cca8d2c85002cb47bdd49870 . Here are some of the results:
As far as I can tell, this change is improvement for all performance metrics, including startup time with Tier0 JIT enabled. The jit speed improvement from optimizing out large amount of dead code that is always present in instantiations of this method dwarfs the cost of the optimizations. Also, I have discovered other issues related to this while looking into this that I will submit PRs for separately. |
Thank you all for a great discussion! |
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
#14178 removed allocations for async method delgates; however the reduced optimization in Tier0 reintroduces allocations due to boxing and non-inlining:
Adding
AggressiveOptimization
toAwaitUnsafeOnCompleted
addresses this.Resolves: https://github.com/dotnet/coreclr/issues/20937
/cc @stephentoub @davidfowl @AndyAyersMS @jkotas