Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Avoid boxing allocations for async in Tier0 #22984

Merged
merged 2 commits into from
Mar 29, 2019

Conversation

benaadams
Copy link
Member

#14178 removed allocations for async method delgates; however the reduced optimization in Tier0 reintroduces allocations due to boxing and non-inlining:

image

Adding AggressiveOptimization to AwaitUnsafeOnCompleted addresses this.

Resolves: https://github.com/dotnet/coreclr/issues/20937

/cc @stephentoub @davidfowl @AndyAyersMS @jkotas

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2019

Do we care about allocations in tier 0? If so, wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0?
https://github.com/dotnet/coreclr/issues/20937#issuecomment-437668306
https://github.com/dotnet/coreclr/issues/14474

@omariom
Copy link

omariom commented Mar 3, 2019

allocations due to boxing and non-inlining

Do they disappear after JIT compiles the method to Tier1?

@davidfowl
Copy link
Member

These allocations show up in all of our benchmarking scenarios. I do care about them but I don’t really understand when the optimizations that would remove them would kick in

@stephentoub
Copy link
Member

These allocations show up in all of our benchmarking scenarios.

That would seem to suggest either:

  • our benchmarks are measuring the wrong thing (i.e. they should be measuring tier 1, not tier 0)
  • tiering is broken or tuned incorrectly

There are lots of places allocations show up with optimizations disabled; I'm missing why we're focusing just on this method.

What's special here?

@stephentoub
Copy link
Member

cc: @kouvel

@benaadams
Copy link
Member Author

Do they disappear after JIT compiles the method to Tier1?

Yes; but as shown in the example that can be after 5MB of allocations (per await)

I'm missing why we're focusing just on this method. What's special here?

Because it isn't really one method, its every async method/awaiter

@benaadams
Copy link
Member Author

i.e. its these tests causing the allocations in Tier0 (awaiter is ITaskAwaiter) and (awaiter is IConfiguredTaskAwaiter)

@stephentoub
Copy link
Member

Because it isn't really one method, its every async method/awaiter

Which also means this is forcing every such instantiation in the whole program into tier 1. This cuts both ways.

I would still like to understand:

  • does this actually make a throughout difference in the ASP.NET benchmarks?
  • why are the benchmarks measuring tier 0 at all? Surely this one allocation isn't the only thing that would affect perf.

@benaadams
Copy link
Member Author

Which also means this is forcing every such instantiation in the whole program into tier 1. This cuts both ways.

Probably fine as you are going down an async pathway?

; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

Tier0 makes particularly bad code for this method (676 bytes + allocations)

; Total bytes of code 676, prolog size 62 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

Whereas Tier1 makes quite good code (43 bytes, no allocations)

; Total bytes of code 43, prolog size 5 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

and that difference will be repeated for every async method; until it hits the Tiering threshold.

Outputting only this method
set COMPlus_JitDisasm=AwaitUnsafeOnCompleted

And running a single request into the plaintext app (so as not to hit Tier0 and Tier1) outputs 293KB of asm for Tier0 and 60KB of asm with tiering off which is an 80% reduction in code generated; and its a fairly simple app with minimal async.

Tier1 looks like this

; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 32

G_M14525_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32
       488BF2               mov      rsi, rdx

G_M14525_IG02:
       498BD0               mov      rdx, r8
       E820D4BFFF           call     AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
       488BD0               mov      rdx, rax
       4C0FBE06             movsx    r8, byte  ptr [rsi]
       440FB64608           movzx    r8, byte  ptr [rsi+8]
       488B0E               mov      rcx, gword ptr [rsi]
       E8EC9CBF5D           call     TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
       90                   nop      

G_M14526_IG03:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      

; Total bytes of code 43, prolog size 5 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

Whereas Tier0 looks like this

; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 160

G_M14508_IG01:
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4881ECA0000000       sub      rsp, 160
       C5F877               vzeroupper 
       488DAC24B0000000     lea      rbp, [rsp+B0H]
       488BF1               mov      rsi, rcx
       488DBD78FFFFFF       lea      rdi, [rbp-88H]
       B91E000000           mov      ecx, 30
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       488BCE               mov      rcx, rsi
       4889A570FFFFFF       mov      qword ptr [rbp-90H], rsp
       48894D10             mov      bword ptr [rbp+10H], rcx
       48895518             mov      bword ptr [rbp+18H], rdx
       4C894520             mov      bword ptr [rbp+20H], r8

G_M14508_IG02:
       488B4D10             mov      rcx, bword ptr [rbp+10H]
       488B5520             mov      rdx, bword ptr [rbp+20H]
       E88576FFFF           call     AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
       488945E8             mov      gword ptr [rbp-18H], rax
       488D55D8             lea      rdx, bword ptr [rbp-28H]

G_M14508_IG03:
       C5F857C0             vxorps   xmm0, xmm0
       C5FA7F02             vmovdqu  qword ptr [rdx], xmm0

G_M14508_IG04:
       488D55D8             lea      rdx, bword ptr [rbp-28H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E8821B6C5F           call     CORINFO_HELP_BOX
       4885C0               test     rax, rax
       745B                 je       SHORT G_M14508_IG06
       488B5518             mov      rdx, bword ptr [rbp+18H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E86A1B6C5F           call     CORINFO_HELP_BOX
       488945A0             mov      gword ptr [rbp-60H], rax
       488B55A0             mov      rdx, gword ptr [rbp-60H]
       48B978BE17CEF77F0000 mov      rcx, 0x7FF7CE17BE78
       E8B3056C5F           call     CORINFO_HELP_ISINSTANCEOFINTERFACE
       4885C0               test     rax, rax
       742C                 je       SHORT G_M14508_IG06
       488B4D18             mov      rcx, bword ptr [rbp+18H]
       E82DEFFFFF           call     Unsafe:As(byref):byref
       488945D0             mov      bword ptr [rbp-30H], rax
       488B4DD0             mov      rcx, bword ptr [rbp-30H]
       488B09               mov      rcx, gword ptr [rcx]
       488B55E8             mov      rdx, gword ptr [rbp-18H]
       41B801000000         mov      r8d, 1
       E8036DBAFF           call     TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
       90                   nop      

G_M14508_IG05:
       488D65F0             lea      rsp, [rbp-10H]
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M14508_IG06:
       488D55D8             lea      rdx, bword ptr [rbp-28H]

G_M14508_IG07:
       C5F857C0             vxorps   xmm0, xmm0
       C5FA7F02             vmovdqu  qword ptr [rdx], xmm0

G_M14508_IG08:
       488D55D8             lea      rdx, bword ptr [rbp-28H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E8031B6C5F           call     CORINFO_HELP_BOX
       4885C0               test     rax, rax
       745E                 je       SHORT G_M14508_IG10
       488B5518             mov      rdx, bword ptr [rbp+18H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E8EB1A6C5F           call     CORINFO_HELP_BOX
       48894598             mov      gword ptr [rbp-68H], rax
       488B5598             mov      rdx, gword ptr [rbp-68H]
       48B958453ECEF77F0000 mov      rcx, 0x7FF7CE3E4558
       E834056C5F           call     CORINFO_HELP_ISINSTANCEOFINTERFACE
       4885C0               test     rax, rax
       742F                 je       SHORT G_M14508_IG10
       488B4D18             mov      rcx, bword ptr [rbp+18H]
       E89EEEFFFF           call     Unsafe:As(byref):byref
       488945C8             mov      bword ptr [rbp-38H], rax
       4C8B45C8             mov      r8, bword ptr [rbp-38H]
       450FB64008           movzx    r8, byte  ptr [r8+8]
       488B4DC8             mov      rcx, bword ptr [rbp-38H]
       488B09               mov      rcx, gword ptr [rcx]
       488B55E8             mov      rdx, gword ptr [rbp-18H]
       E8816CBAFF           call     TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
       90                   nop      

G_M14508_IG09:
       488D65F0             lea      rsp, [rbp-10H]
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M14508_IG10:
       488D55D8             lea      rdx, bword ptr [rbp-28H]

G_M14508_IG11:
       C5F857C0             vxorps   xmm0, xmm0
       C5FA7F02             vmovdqu  qword ptr [rdx], xmm0

G_M14508_IG12:
       488D55D8             lea      rdx, bword ptr [rbp-28H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E8811A6C5F           call     CORINFO_HELP_BOX
       4885C0               test     rax, rax
       7479                 je       SHORT G_M14508_IG14
       488B5518             mov      rdx, bword ptr [rbp+18H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E8691A6C5F           call     CORINFO_HELP_BOX
       48894590             mov      gword ptr [rbp-70H], rax
       488B5590             mov      rdx, gword ptr [rbp-70H]
       48B950363ECEF77F0000 mov      rcx, 0x7FF7CE3E3650
       E8B2046C5F           call     CORINFO_HELP_ISINSTANCEOFINTERFACE
       4885C0               test     rax, rax
       744A                 je       SHORT G_M14508_IG14

G_M14508_IG13:
       488B5518             mov      rdx, bword ptr [rbp+18H]
       48B9D0C358CEF77F0000 mov      rcx, 0x7FF7CE58C3D0
       E83A1A6C5F           call     CORINFO_HELP_BOX
       48894588             mov      gword ptr [rbp-78H], rax
       488B5588             mov      rdx, gword ptr [rbp-78H]
       48B950363ECEF77F0000 mov      rcx, 0x7FF7CE3E3650
       E843056C5F           call     CORINFO_HELP_CHKCASTINTERFACE
       48894580             mov      gword ptr [rbp-80H], rax
       488B4D80             mov      rcx, gword ptr [rbp-80H]
       488B55E8             mov      rdx, gword ptr [rbp-18H]
       49BB0810FACDF77F0000 mov      r11, 0x7FF7CDFA1008
       3909                 cmp      dword ptr [rcx], ecx
       FF155D5CA9FF         call     [IStateMachineBoxAwareAwaiter:AwaitUnsafeOnCompleted(ref):this]
       EB2F                 jmp      SHORT G_M14508_IG15

G_M14508_IG14:
       488B4DE8             mov      rcx, gword ptr [rbp-18H]
       49BB0010FACDF77F0000 mov      r11, 0x7FF7CDFA1000
       3909                 cmp      dword ptr [rcx], ecx
       FF153D5CA9FF         call     [IAsyncStateMachineBox:get_MoveNextAction():ref:this]
       48898578FFFFFF       mov      gword ptr [rbp-88H], rax
       488B9578FFFFFF       mov      rdx, gword ptr [rbp-88H]
       488B4D18             mov      rcx, bword ptr [rbp+18H]
       E83E3DFFFF           call     ConfiguredTaskAwaiter:UnsafeOnCompleted(ref):this
       EB00                 jmp      SHORT G_M14508_IG15

G_M14508_IG15:
       488D65F0             lea      rsp, [rbp-10H]
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M14508_IG16:
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper 
       488B6920             mov      rbp, qword ptr [rcx+32]
       48896C2420           mov      qword ptr [rsp+20H], rbp
       488DADB0000000       lea      rbp, [rbp+B0H]

G_M14508_IG17:
       488955A8             mov      gword ptr [rbp-58H], rdx
       488B4DA8             mov      rcx, gword ptr [rbp-58H]
       48894DC0             mov      gword ptr [rbp-40H], rcx
       488B4DC0             mov      rcx, gword ptr [rbp-40H]
       33D2                 xor      rdx, rdx
       E8D36ABAFF           call     Task:ThrowAsync(ref,ref)
       488D05C0FFFFFF       lea      rax, G_M14508_IG15

G_M14508_IG18:
       4883C430             add      rsp, 48
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M14508_IG19:
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper 
       488B6920             mov      rbp, qword ptr [rcx+32]
       48896C2420           mov      qword ptr [rsp+20H], rbp
       488DADB0000000       lea      rbp, [rbp+B0H]

G_M14508_IG20:
       488955B0             mov      gword ptr [rbp-50H], rdx
       488B4DB0             mov      rcx, gword ptr [rbp-50H]
       48894DB8             mov      gword ptr [rbp-48H], rcx
       488B4DB8             mov      rcx, gword ptr [rbp-48H]
       33D2                 xor      rdx, rdx
       E8936ABAFF           call     Task:ThrowAsync(ref,ref)
       488D0580FFFFFF       lea      rax, G_M14508_IG15

G_M14508_IG21:
       4883C430             add      rsp, 48
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

; Total bytes of code 676, prolog size 62 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

@stephentoub
Copy link
Member

Tier0 makes particularly bad code for this method whereas Tier1 makes quite good code

The whole point of tier 0 is to be fast; it's expected to often produce poor CQ.

@davidfowl
Copy link
Member

does this actually make a throughout difference in the ASP.NET benchmarks?

Every little bit counts. We're also trying to figure out how to reduce allocations generally across the stack for other reasons (like more interesting GC modes). We've spent lots of time on reducing allocations in the core and it's paid off massively in-terms of performance. We're almost at 1M RPS more than we were in .NET Core 2.2 and that's not in the platform level benchmark, that's the real stack with dependency injection and the HttpContext etc.

I also dislike the fact that these show up more than string allocations when doing "the obvious" benchmark.

The whole point of tier 0 is to be fast; it's expected to often produce poor CQ.

We can flip the question then, does enabling this optimization have a huge impact on startup time?

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2019

We can flip the question then, does enabling this optimization have a huge impact on startup time?

Tier 0? Yes. That's pretty much its whole reason to exist.

We're also trying to figure out how to reduce allocations generally across the stack for other reasons (like more interesting GC modes). We've spent lots of time on reducing allocations in the core and it's paid off massively in-terms of performance.

In tier 0? Why area we measuring tier 0 throughput/allocations? Why are we piecemeal optimizing tier 0 throughput/allocations by opting out specific methods? If throughput/allocations there matter, why not disable tier 0 entirely rather than piecemeal?

@benaadams
Copy link
Member Author

If throughput/allocations there matter, why not disable tier 0 entirely rather than piecemeal?

Tier0 doesn't particularly increase allocations in the framework? Its just this method where it does.

As precedent AggressiveOptimization was added to the span methods that use intrinsics #22191 because Tier0 also generates dreadful code for them as they are also based around branch elimination based on Jit constants as is this method.

@benaadams
Copy link
Member Author

We can flip the question then, does enabling this optimization have a huge impact on startup time?

Tier 0? Yes. That's pretty much its whole reason to exist.

I assume @davidfowl was referring to specifically this change, rather than switching off Tiering for everything?

@benaadams
Copy link
Member Author

Why area we measuring tier 0 throughput/allocations?

What's the user advice?

"When using dotTrace (or other tool) to look at your program's allocations under load ignore the 76k boxed ValueTaskAwaiters as they go away after a while"

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2019

What's the user advice? "When using dotTrace (or other tool) to look at your program's allocations under load ignore the 76k boxed ValueTaskAwaiters as they go away after a while"

Turn off tiered compilation, or don't measure the first N requests.

@omariom
Copy link

omariom commented Mar 3, 2019

@jkotas, what is the meaning of AggressiveOptimization ? "Use Tier1" or "Use MaxTier"?
I mean if in the future we get Tier2.

@davidfowl
Copy link
Member

I’m currently turning it off to get a valid representation of what’s actually being allocated and I think that’s the problem.

I was referring to this specific optimization related to ValueTask with an IValueTaskSource, not anything more generic than that.

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2019

I’m currently turning it off to get a valid representation of what’s actually being allocated and I think that’s the problem.

The problem being that you have to? Then let me reiterate my statement from earlier: "wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0?
#20937 (comment)
#14474"

This is not the only method that incurs boxing allocations in tier 0 that are eliminated in tier 1. We shouldn't be sprinkling these attributes on every such method. If the concern is the ability to measure allocations and setting an environment variable is too onerous, then we should address the root problem. Further, developers doing benchmarking need to avoid measuring startup costs, regardless of tiering.

@davidfowl
Copy link
Member

The problem being that you have to? Then let me reiterate my statement from earlier: "wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0?

I don't have a problem with that, what needs to happen to do that?

This is not the only method that incurs boxing allocations in tier 0 that are eliminated in tier 1. We shouldn't be sprinkling these attributes on every such method. If the concern is the ability to measure allocations and setting an environment variable is too onerous, then we should address the root problem.

This is the one that shows up in the default ASP.NET Core scenario and any scenario using pipelines really, or network stream or anything that uses an IValueTaskSource backed ValueTask. I really dislike the fact that this will show up by default when bench marking your ASP.NET Core application and I don't think it's reasonable to tell customers to turn off tier0 if you want to reduce these allocations. Maybe there are a ton of places allocations show up in tier0, we should make sure those are all fine but this is the one that is at the top of every ASP.NET Core project by default.

That said, I have no idea what research went into deciding which optimizations on or off nor do I know what applications were used or what stats were observed.

@benaadams
Copy link
Member Author

"wouldn't it be better to address this by restoring some of the boxing optimizations to apply to tier 0?

Yes; but that is "Future"

@benaadams
Copy link
Member Author

benaadams commented Mar 4, 2019

Its about 37MB before Tier1 kicks in under load

image

Though that's across several different methods (~18MB for a hot method)

image

@jkotas
Copy link
Member

jkotas commented Mar 4, 2019

@jkotas, what is the meaning of AggressiveOptimization ? "Use Tier1" or "Use MaxTier"?

It is likely that AggressiveOptimization is not going to mean Tier1 in future: https://github.com/dotnet/corefx/issues/32235#issuecomment-424068621

As precedent AggressiveOptimization was added to the span methods that use intrinsics #22191

This is not a good precedent. #22191 enabled AggressiveOptimization for a small number of methods only. This change is enabling AggressiveOptimization for many methods (generic code stamping).

impact on startup time

Yes, the startup time is what needs to be measured for optimizations like this. The allocation rate is a secondary metric.

restoring some of the boxing optimizations to apply to tier 0

Agree that this would be a better way to fix this.

If it helps, we may consider introducing a helper method (can be internal for now) to hint the JIT that it is important to devirtualize and unbox-eliminate certain callsites, e.g. RuntimeHelpers.DevirtualizeHint<TAwaiter, IStateMachineBoxAwareAwaiter>(awaiter).AwaitUnsafeOnCompleted(box). It would help to document places in the code that depend on devirtualization for reasonable performance.

@AndyAyersMS
Copy link
Member

I will look into what it would take to enable the box optimizations for Tier0.

I'm not sure what the dependence closure for this is just yet. Seems like we will also need to enable ref class tracking, type intrinsic expansions, and type test optimization. These are all plausible candidates for enabling in Tier0 by default anyways: they happen early, do not require a lot of in-depth analysis, and should generally make the jit run faster (or at least, no slower), And when they fire, the the the jit will produce smaller and faster jitted code.

If we also need inlining to get at these cases, then perhaps some form of hinting will be needed.

@benaadams
Copy link
Member Author

If we also need inlining to get at these cases, then perhaps some form of hinting will be needed.

For this particular method Unsafe:As(byref):byref seems to be the main one

Inlines into 06003583 AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
  [0 IL=0002 TR=000003 06003584] [FAILED: too many il bytes] AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
  [1 IL=0043 TR=000052 06005608] [aggressive inline attribute] Unsafe:As(byref):byref
  [0 IL=0057 TR=000063 0600365D] [FAILED: unprofitable inline] TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 7, 2019

I have something that seems to work passably well. Preview here: EnableBoxingOptsTier0.

Enabling this in tier 0 requires

  • intrinsic expansion
  • inline box expansion and use of a unique box temp
  • importer cast optimization
  • expression folding
  • devirtualization
  • branch optimizations (as a workaround, but probably not a bad idea in general)

Inlining is not required to optimize the async state machine box cases.

On the example code from #14698:

Top method improvements by size (bytes):
       -2436 (-50.08% of base) : ex.dasm - System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[Int64][System.Int64]:AwaitUnsafeOnCompleted(byref,byref):this (7 methods)
        -168 (-7.42% of base) : ex.dasm - System.Runtime.CompilerServices.AsyncMethodBuilderCore:Start(byref) (8 methods)
          -4 (-2.06% of base) : ex.dasm - System.Threading.Tasks.Task`1[Int64][System.Int64]:TrySetResult(long):bool:this

Top method regressions by size (percentage):
         297 (10.54% of base) : ex.dasm - System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[Int64][System.Int64]:GetStateMachineBox(byref):ref:this (7 methods)

Top method improvements by size (percentage):
       -2436 (-50.08% of base) : ex.dasm - System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[Int64][System.Int64]:AwaitUnsafeOnCompleted(byref,byref):this (7 methods)
        -168 (-7.42% of base) : ex.dasm - System.Runtime.CompilerServices.AsyncMethodBuilderCore:Start(byref) (8 methods)
          -4 (-2.06% of base) : ex.dasm - System.Threading.Tasks.Task`1[Int64][System.Int64]:TrySetResult(long):bool:this

Broader Tier0 PMI (using the new COMPlus_JitTier0) shows at least one failure, so this is not fully baked yet.

The regression in GetStateMachineBox comes from the inline box expansion -- ideally we'd only do inline box expansion when it leads to optimization, but at the time we process the box we can't easily anticipate if optimization will happen, and we don't have any kind of undo mechanism. That is the only really expansive transformation enabled here; all the others should generally reduce and simplify the jit IR.

Based on partial PMI results, it looks roughly like a ~1.5% aggregate code size decrease (biased somewhat by large size decreases for in Vector and HW intrinsic code). So something along these lines should produce smaller and faster code, jit a bit faster, and allocate less on the heap.

Will attempt some throughput measurements soon.

@AndyAyersMS
Copy link
Member

Issue was that somehow in lower we retype int to void in degenerate branch forms like:

N001 (  1,  1) [000314] ------------       t314 =    CNS_INT   int    1
                                                 /--*  t314   int    
N002 (  3,  3) [000112] ----G-------              *  JTRUE     void  

worked around it for now by also enabling fgFoldConditional to get rid of these before lower can get tripped up.

"tier0" pmi diffs:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -724138 (-1.59% of base)
    diff is an improvement.
Top file regressions by size (bytes):
         817 : System.Runtime.Numerics.dasm (0.85% of base)
         774 : System.Text.RegularExpressions.dasm (0.30% of base)
         390 : System.IO.FileSystem.dasm (0.33% of base)
         384 : System.Net.Primitives.dasm (0.46% of base)
         337 : System.Private.Uri.dasm (0.27% of base)
Top file improvements by size (bytes):
     -325320 : System.Private.CoreLib.dasm (-5.44% of base)
      -66455 : System.Private.Xml.dasm (-1.56% of base)
      -41043 : Microsoft.CodeAnalysis.CSharp.dasm (-0.82% of base)
      -33919 : CommandLine.dasm (-6.15% of base)
      -26907 : System.Data.Common.dasm (-1.99% of base)
119 total files with size differences (100 improved, 19 regressed), 10 unchanged.
Top method regressions by size (bytes):
        1184 (21.94% of base) : System.Private.Xml.dasm - XmlSchemaInference:InferSimpleType(ref,byref):int
         619 ( 7.23% of base) : System.Collections.dasm - Enumerator:System.Collections.IEnumerator.get_Current():ref:this (45 methods)
         615 ( 6.85% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:FoldNeverOverflowBinaryOperators(int,ref,ref):ref
         472 (53.82% of base) : System.Private.CoreLib.dasm - Path:Populate83FileNameFromRandomBytes(long,int,struct)
         453 ( 8.28% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeArgument(byref,struct):struct:this (5 methods)
Top method improvements by size (bytes):
      -12274 (-31.96% of base) : System.Private.CoreLib.dasm - Vector`1:SquareRoot(struct):struct (4 methods)
       -8064 (-22.69% of base) : System.Private.CoreLib.dasm - Vector`1:CopyTo(ref,int):this (4 methods)
       -7983 (-12.46% of base) : System.Private.CoreLib.dasm - Vector`1:Min(struct,struct):struct (4 methods)
       -7983 (-12.46% of base) : System.Private.CoreLib.dasm - Vector`1:Max(struct,struct):struct (4 methods)
       -7956 (-14.39% of base) : System.Private.CoreLib.dasm - Vector`1:Equals(struct,struct):struct (4 methods)
Top method regressions by size (percentage):
         145 (94.16% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodTypeInferrer:DependsTransitivelyOn(int,int):bool:this
          46 (92.00% of base) : Microsoft.DotNet.ProjectModel.dasm - <GetRangesWithSourceLocations>d__8:System.Collections.IEnumerator.get_Current():ref:this
          40 (88.89% of base) : System.Private.CoreLib.dasm - DTSubString:get_Item(int):ushort:this
         141 (88.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodTypeInferrer:UpdateDependenciesAfterFix(int):this
          42 (87.50% of base) : Microsoft.CodeAnalysis.dasm - SyntaxTriviaList:get_Nodes():ref:this
Top method improvements by size (percentage):
       -3879 (-93.16% of base) : System.Private.CoreLib.dasm - Vector64`1:get_IsSupported():bool (4 methods)
       -3879 (-93.16% of base) : System.Private.CoreLib.dasm - Vector128`1:get_IsSupported():bool (4 methods)
       -3879 (-93.16% of base) : System.Private.CoreLib.dasm - Vector256`1:get_IsSupported():bool (4 methods)
       -4734 (-90.78% of base) : System.Private.CoreLib.dasm - ThrowHelper:ThrowForUnsupportedVectorBaseType() (5 methods)
       -6113 (-80.81% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IsTypeComparableAsBytes(byref):bool (7 methods)
16057 total methods with size differences (13146 improved, 2911 regressed), 177877 unchanged.

Looked into a few of the regressions and they're all from the inline box expansion; the worst cases the box has a lot of GC refs and we expand into a series of helper calls to copy these over.

@AndyAyersMS
Copy link
Member

I have looked at the code size and jit throughput impact of always generating Tier1 code for AwaitUnsafeOnCompleted as is proposed by this PR. Using a sample of 25 different instantiations of different awaiters, measuring jit time and code size with perfview on a 3.0 preview SDK:

Jit Mode Total Time (ms) Total Size Avg Time (ms) Avg Size Note
Tier0 6.3 16509 0.25 660 current behavior
Tier1 7.9 2553 0.32 102 this PR

Average increase in jit time per method from Tier1 is 25%, but is small in absolute terms, around 70 microseconds per method. The Tier1 codegen is considerably more compact, around 100 bytes on average versus 660 bytes.

The prototype Tier0 + opts referred to above would likely jit even faster (~25%?) than the current Tier0 and would produce somewhere around 300 bytes of code per method.

For the entire 3.0 cycle, Tier0 and minopts have generated the same code, and we have done a lot of testing on this combined mode. Adding optimizations to Tier0 would make Tier0 diverge from minopts, and we would have to increase the scope of our testing considerably to feel confident the now divergent Tier0 jit codegen mode was high quality (basically forcing Tier0 codegen for all jitting, running through all our stress matrix, etc). So general consensus on the jit team is that it is too late in the cycle to add optimizations to Tier0.

My suggestion is that we consider taking the change in this PR for 3.0 as a targeted fix for this set of problematic boxes, and that we pursue the more general fix of enabling Tier0 box removal optimizations -- and perhaps more -- in a subsequent release where we have more time to evaluate and test.

@AndyAyersMS
Copy link
Member

Right, there are several different patterns we need to optimize here:

  • boxing for value class check
  • boxing for type tests
  • boxing for interface call

The first two could be handled by lookahead. The third is more complicated. So we could reduce allocations somewhat via simple pattern matches, but could not entirely eliminate them.

@jkotas
Copy link
Member

jkotas commented Mar 14, 2019

Other problematic boxing patterns with generics is cast to T which has to go via object for C#

This one is handled by pattern match already.

The first two could be handled by lookahead.

I think the pattern match is worth doing for the first two cases. I will think about the third one.

@davidfowl
Copy link
Member

However, I believe it is very likely regression for more realistic apps that hit hundreds or even thousands of these instantiations, and that are not immediately hammered by peak request load.

How do we prove there's no regression?

@vancem
Copy link

vancem commented Mar 14, 2019

One can probably disable these similar to disabling other events, though I don't know the magic syntax.

perfVIew /CircularMB=4000 /kernelEvents=default-NetworkTcpIP /providers:.NETTasks:0:0,.NETFramework:0:0,Microsoft-Diagnostics-DiagnosticSource:0:0 /dotnetalloc collect

Should also turn off the DiagnosticSource provider. Note you can also force PerfView to only give what you really need (rather than subtracting what you don't like from the defaults. For exampl

    perfVIew /CircularMB=4000 /kernelEvents=default-NetworkTcpIP /clrEvents=JitSymbols+GC+Stack 

Should give you just CPU information and some coarse (every 100K, low overhead) memory allocation information. (you can reduce it to just CPU by removing the GC+Stack in the /clrEvents). You can try to add /dotnetalloc back and it should work but I have not tried that combination.

@vancem
Copy link

vancem commented Mar 14, 2019

I would like to summarize the issue to this point (if only for my own benefit). First we now have some good data that Ben collected,, namely https://aoa.blob.core.windows.net/aspnet/PerfViewData3.etl.zip which shows the whole end-to-end (with the JITTing and tiering, and allocations, etc), so it does tell us quite a bit. Note that this data should NOT be used to judge CPU costs (only memory allocations and general flow), as we KNOW that you have to turn off MANY of PerfVIew's defaults to get unperturbed CPU traces (see above). However I don't think we need that.

The scenario is PlatformBenchmarks.exe (very simple ASP.NET Core app). It runs for about 34 seconds, and by about 7 seconds after process launch it has Tier-1 Jitted enough methods that the allocations that started this are optimized away and life is good.

The concern is that 7 seconds is alot of time, and that we don't like all the allocations in the first 7 seconds. Note that in this example the amount of allocation you can avoid adds up to about 5 to 6MB (not that big really), This is dominated by a single type ValueTaskAwaiter (thus a single optimization will fix this instance).

OK, that is the data. Why do we Care, what is surprising? What should we do?

  • First, you can argue that this is not a big deal. 5MB in the first 7 seconds and then stops from the on is probably not a big deal (our GC is very good at cleaning these very short lived objects up efficiently). Mostly it is about the 'noise' when running benchmarks (they want to be measuring steady state, but they are not). It probably does not affect CPU or wall clock time much. Certainly in terms of averages over a run of 30 seconds (benchmarks tend to run at least that long) expecially with a 'warm up (again over 10 seconds of warmup is reasonably common). It probably does not add up to much.

  • Aruably, our heuristics are NOT working well in this case. 7 Seconds is a long time, and if you look at the trace, most Tier 1 JITTing is done 0.5 seconds into the trace, For some reason (and it is not the 100 msec reset that Andy talks about, it waits and additional 3.7 seconds before doing Tier1 Jit. Then it does it in fits and starts for the next 2.3 seconds (there are 500 msec gaps in this JItting). This cearly could be improved. It seems reasonable that tier-1 JITTing could start well before second 1, and could be done 1 second after that (I am being pretty generous). Thus something like 2 seconds seems achievable. Now maybe I am being unrealistic, but it seems like we should at least investigate to determine why my 2 second goal is unrealistic.

  • One way of attacking the problem is to make the JIT smarter on Tier 0. Jan was suggesting this. I think this is interesting, but only if it is easy and maintainable. We don't have reason (so far) to believe that these particular optimizations are so importnat that we should include them in Tier 0.

  • Another approach is to decorate 'critical' methods with the [AgressiveOptimziation] flag. This is what this PR suggests. Note that scnearios like this are WHAT THAT ATTRIBUTE ARE FOR. THe biggest problem with this approach is 'where to draw the line' (and whether other approaches (e.g. the JIT or shortening the heuristics), obviate the need for this. What I will say, I dont' have a problem with this PR (that is adding the attribute), because it really is saying 'I know this part of the runtime is VERY HOT' as long as are talking about a dozen or two of these, not hundreds or more, I just don't see it as a big problem.

So what to do?

  1. I think we SHOULD investigate the rejit heurisitcs, as I think we can reduce the time (which is goodness everywhere).
  2. I am OK with this PR, but I could also live with waiting to see if (1) solves the problem.
  3. If the JIT team is HAPPY to have these optimization in its 'unoptimzied' code, than sure, do that. If they would rather not, I would wait until we see if (1) or (2) is sufficient.

And frankly we can decide that this just is not a big enough problem (but I would argue we should still do (1)).

For what it is worth...

@benaadams
Copy link
Member Author

There's also the aspect that every async method in a .NET Core 3.0 app will allocate these boxes the first N times it is called. It isn't just at startup, it depends on the flow of the program and when the methods are called.

@kouvel
Copy link
Member

kouvel commented Mar 14, 2019

For some reason (and it is not the 100 msec reset that Andy talks about, it waits and additional 3.7 seconds before doing Tier1 Jit. Then it does it in fits and starts for the next 2.3 seconds (there are 500 msec gaps in this JItting). This cearly could be improved. It seems reasonable that tier-1 JITTing could start well before second 1, and could be done 1 second after that (I am being pretty generous). Thus something like 2 seconds seems achievable. Now maybe I am being unrealistic, but it seems like we should at least investigate to determine why my 2 second goal is unrealistic.

The 100 ms is a window, as long as a new method is called in the last 100 ms it will keep delaying tier 1 activities. So the total delay will last as long as new methods are being called frequently. The reason for this is that call counting (especially) and background tier 1 jitting (to some degree) have significant overhead together, and allowing tier 1 jitting to happen during those startup-like activities was slowing down startup time considerably. It should be possible to decrease the delay some amount by making call counting cheaper, but it will probably take more informed heuristics to eliminate the delay without sacrificing startup time, for example jitting a method that is totally worth it during startup should be ok, as the JIT time would be very quickly compensated for, but tier 1 jitting a method that does not give much perf gain though it is called frequently during startup, may not be worth the overhead during startup. There is an environment config option COMPlus_TieredCompilation_Tier1CallCountingDelayMs to configure the delay window (default is 0x64 and it can be set to 0 to disable the delay).

@vancem
Copy link

vancem commented Mar 14, 2019

For some reason (and it is not the 100 msec reset that Andy talks about, it waits and additional 3.7 seconds before doing Tier1 Jit.
the 100 ms is a window, as long as a new method is called in the last 100 ms it will keep delaying tier 1 activities

It is not as simple as a resettable 100 msec window. If you look at https://aoa.blob.core.windows.net/aspnet/PerfViewData3.etl.zip you will see a (Tier 0) JIT event at 4,949.143 and then one at 5,596.865 (over 500 msec later) but not Tier1 JItting happening. Then there is a Tier 1 JIT at 6,909.603 (over a second later!) and then again nothing until a Tier 0 JIT at 8,815.401 (1.8 seconds later!).

Now this behavior may be explainable based on App Behavior, but it is not obvious on the surface, and my understanding of this app is that it is pretty simple (it does a small 'hello world' like operation MANY times), which suggests it is not the app that is causing JIT's to 'spread out' like this. It seems worthy of an investigation.

@jkotas
Copy link
Member

jkotas commented Mar 14, 2019

@vancem Thank you for writing down the summary. We may want to move the discussion about the heuristic problem and the boxing problem into separate issues.

Note that scenearios like this are WHAT THAT ATTRIBUTE ARE FOR.

I agree with general, but I have problem with this attribute being applied to this specific method. This specific method is among the top most frequently instantiated generic method ever. It feels like a wrong trade-off to disable Tier0 for all instantiations of this method when only very few frequently called instantiations really need it.

If this was a regular non-generic method, I would be perfectly fine with this PR.

these optimization in its 'unoptimzied' code

This pattern-matching based optimization does exist for the unoptimized code already for the box+unbox case and it is very simple. I am proposing to extend this pattern match for few more cases to handle the first two cases above.

@AndyAyersMS
Copy link
Member

For Ben's V3 trace I think the initial pause in jitting may just be the way he ran things -- per instructions above after launching the app one must switch to a new window to run wrk to apply the load, and this might take a few seconds. During that time the app is largely idle and so maybe few if any methods reach the 30 count threshold to trigger tier1. Once he load was applied we went back into Tier0 mode to jit the newly required methods, and when we made it through they quickly hit the 30 count threshold and we spent quite a bit of time catching up on the tier1 backlog.

image

Small vertical lines are at 100ms intervals.

Not sure yet how to account for the gaps in the tier1 method production rate; ParseHeaders takes 50ms but those three gaps are much larger. See detail plot below...

image

Would be nice to have some Tier1 queue events: start, stop, and for stop, because of quue empty or worker thread yield.

@benaadams
Copy link
Member Author

may just be the way he ran things -- per instructions above after launching the app one must switch to a new window to run wrk to apply the load, and this might take a few seconds.

Sounds likely 😄

@benaadams
Copy link
Member Author

What does Tier1 use to Jit? Does it use the regular threadpool?

If so its quite heavily loaded, so might just be the number of items in the threadpool queue before it. Or does it use its own threads?

@kouvel
Copy link
Member

kouvel commented Mar 15, 2019

It uses a thread pool thread, and yields it after at least 50 ms of work has been done and queues another work item. It's a native work item though, so it doesn't necessarily fall behind the managed work queue. I'll take a closer look at what's causing those gaps, if that's the issue it should be fairly easy to switch to a use a separate thread.

@AndyAyersMS
Copy link
Member

Green bars below shows ranges of Tier1 jit events that all happen on the same thread. So indeed those other thread event gaps correspond to switching which thread handles Tier1 jitting.

image

@kouvel
Copy link
Member

kouvel commented Mar 17, 2019

Here is some raw data from instrumenting tiering. In this particular run, I waited a while after server start before starting the load. The timings are similar when starting the load immediately after server start.

Legend for column "Tiering event span (ms)"

  • Orange - Span of time during which tiering was paused due to startup-like activities
  • Black - Span of time during which tiering was not paused

Legend for column "Tier 1 jit event span (ms)"

  • Orange - Span of time during which tier 1 jitting was paused due to tiering being paused (due to startup-like activities)
  • Red - Span of time during which tier 1 jitting was not happening due to yielding the thread pool thread and waiting for another unmanaged thread pool work item to run to continue tier 1 jitting
  • Green - Tier 1 jit queue is empty
  • Black - Span of time during which tier 1 jitting was happening

image

Notes:

  • The largest gaps in tier 1 jitting (Green) are because the tier 1 JIT queue is empty. During those gaps, call counting is happening and methods have not reached tier 1 yet.
  • The second-largest gaps in tier 1 jitting (Orange) are because tiering is paused due to startup-like activities (new methods have been called for the first time recently)
  • Gaps in tier 1 jitting due to yielding the thread pool thread (Red) are very short
    • The largest gap I was able to get is a bit less than 6 ms, once when starting the load immediately after server start
  • I don't see anything abnormal happening here

I'm not able to reproduce the latency issue locally:

Tier=off

  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.94ms    4.11ms  93.10ms   94.84%
    Req/Sec   211.90k    21.25k  308.85k    80.37%
  12684736 requests in 15.08s, 1.08GB read
Requests/sec: 841303.51
Transfer/sec:     73.01MB

Tier=on

  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.71ms    2.86ms  80.39ms   91.42%
    Req/Sec   215.33k    21.14k  270.38k    78.36%
  12869904 requests in 15.04s, 1.09GB read
Requests/sec: 855721.77
Transfer/sec:     74.26MB

There is some error in RPS but these are somewhat representative, and the latency and latency standard deviation are fairly stable and reproducible.

  • Latency standard deviation is consistently high, and consistently much higher when tiering is off
  • RPS is roughly the same considering error in the 6 samples each that I took
  • I got very similar results when delaying the load following server start, and when starting the load immediately following server start
  • I wasn't able to reproduce any socket errors in my local runs
  • I used wrk on an Ubuntu VM running on the same machine to generate the load
  • It appears as though the latency issue seen above is not related to tiering, maybe has something to do with socket errors or something else?

@jkotas
Copy link
Member

jkotas commented Mar 29, 2019

Average increase in jit time per method from Tier1 is 25%

I am not able to reproduce this result. I see up to ~2x improvement in jit time with this change, the worst cases are shared generics that run up to 10% slower in extreme cases.

I have created a benchmark to allow me to measure the JIT time for different combinations in isolation: https://gist.github.com/jkotas/102dc708cca8d2c85002cb47bdd49870 . Here are some of the results:

AsyncTaskMethodBuilder<int> + TaskAwaiter<int> before: 2782ms, after: 1562ms
AsyncTaskMethodBuilder<string> + TaskAwaiter<string> before: 4156ms , after: 3046ms
AsyncTaskMethodBuilder<int> + ValueTaskAwaiter<int> before: 2781ms , after: 2391ms
AsyncTaskMethodBuilder<(string,int)> + ValueTaskAwaiter<(int,int,string)> before: 4219ms, after: 4500ms

As far as I can tell, this change is improvement for all performance metrics, including startup time with Tier0 JIT enabled. The jit speed improvement from optimizing out large amount of dead code that is always present in instantiations of this method dwarfs the cost of the optimizations.

Also, I have discovered other issues related to this while looking into this that I will submit PRs for separately.

@jkotas
Copy link
Member

jkotas commented Mar 29, 2019

Thank you all for a great discussion!

@jkotas jkotas merged commit dc0d072 into dotnet:master Mar 29, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Mar 29, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Mar 29, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Mar 29, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@benaadams benaadams deleted the allocs-AwaitUnsafeOnCompleted branch March 29, 2019 09:33
jkotas pushed a commit to dotnet/corert that referenced this pull request Mar 29, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Mar 29, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Mar 30, 2019
Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants