Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Do not inline methods that never return #6103

Merged
merged 2 commits into from
Aug 5, 2016
Merged

Conversation

mikedn
Copy link

@mikedn mikedn commented Jul 2, 2016

Methods that do not contain return blocks can't ever exit normally, they either throw or loop indefinitely. Inlining is not beneficial in such cases as it increases the code size without providing any speed benefits. In the particular case of throws the inlined code can easily be 10 times larger than the call site.

The call to fgMoreThanOneReturnBlock has been replaced with code that does the same thing but also detects the existence of at least one return block. This avoids walking the basic block list twice.

Note that BBJ_RETURN blocks are also generated for CEE_JMP. Methods exiting via CEE_JMP instead of CEE_RET will continue to be inlined (assuming they were inlined before this change).

@omariom
Copy link

omariom commented Jul 2, 2016

@mikedn
Will it be like throw in terms of optimizations around of callsites to such methods?

@mikedn
Copy link
Author

mikedn commented Jul 2, 2016

Will it be like throw in terms of optimizations around of the callsites to such methods?

Yes. I'll post an example later.

Anyway, note the WIP title 😄. This very likely needs more work. I'm just trying to see if it works and show something to the JIT team to get some feedback.

@omariom
Copy link

omariom commented Jul 2, 2016

Here is an especially bad example of inlining such methods:
It is from FindEntry of Dictionary<TKey,TValue> when TKey is a ref type.
Turtles all the way down 😄

private int FindEntry(TKey key) {
    if( key == null) {
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
    }
...
       test     rdi, rdi
       jne      G_M14733_IG03
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rdi, rax
       mov      ecx, 5
       call     System.ThrowHelper:GetArgumentName(int):ref
       mov      rbx, rax
       mov      ecx, 0x1C4ED
       mov      rdx, 0xD1FFAB1E
       call     CORINFO_HELP_STRCNS
       mov      rcx, rax
       call     System.Environment:GetResourceStringLocal(ref):ref
       mov      rbp, rax
       mov      rcx, rdi
       call     System.Exception:Init():this
       lea      rcx, bword ptr [rdi+32]
       mov      rdx, rbp
       call     CORINFO_HELP_ASSIGN_REF
       mov      dword ptr [rdi+132], 0xD1FFAB1E
       lea      rcx, bword ptr [rdi+144]
       mov      rdx, rbx
       call     CORINFO_HELP_ASSIGN_REF
       mov      dword ptr [rdi+132], 0xD1FFAB1E
       mov      dword ptr [rdi+132], 0xD1FFAB1E
       mov      rcx, rdi
       call     CORINFO_HELP_THROW

G_M14733_IG03:

@mikedn
Copy link
Author

mikedn commented Jul 2, 2016

Yeah, with this change that FindEntry code becomes:

G_M14720_IG02:
       test     rdi, rdi
       jne      SHORT G_M14720_IG03
       mov      ecx, 5
       call     System.ThrowHelper:ThrowArgumentNullException(int)
G_M14720_IG03:

It's still not quite right, I was expecting the conditional jump to be inverted and the call be moved to a cold block. Seems like fgRemoveRestOfBlock doesn't always do what I need.

PS: After workaround for BAD_INLINEE FindEntry looks fine.

@mikedn
Copy link
Author

mikedn commented Jul 2, 2016

Here's an example that works as expected:

static void ThrowArgumentNullException() {
    throw new ArgumentNullException();
}
static int Main(string[] args) {
    if (args == null) {
        ThrowArgumentNullException();
        Console.WriteLine("bad");
    }
    return args.Length;
}

Without this change the generated code is:

G_M45812_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32
G_M45812_IG02:
       4885C9               test     rcx, rcx
       7522                 jne      SHORT G_M45812_IG03
       48B94806C64CFE7F0000 mov      rcx, 0x7FFE4CC60648
       E857EC415F           call     CORINFO_HELP_NEWSFAST
       488BF0               mov      rsi, rax
       488BCE               mov      rcx, rsi
       E87422DD5D           call     System.ArgumentNullException:.ctor():this
       488BCE               mov      rcx, rsi
       E8A4AAE95E           call     CORINFO_HELP_THROW
G_M45812_IG03:
       8B4108               mov      eax, dword ptr [rcx+8]
G_M45812_IG04:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret

With this change the generated code is:

G_M45811_IG01:
       4883EC28             sub      rsp, 40
G_M45811_IG02:
       4885C9               test     rcx, rcx
       7408                 je       SHORT G_M45811_IG05
G_M45811_IG03:
       8B4108               mov      eax, dword ptr [rcx+8]
G_M45811_IG04:
       4883C428             add      rsp, 40
       C3                   ret
G_M45811_IG05:
       E8E2FBFFFF           call     Program:ThrowArgumentNullException()
       CC                   int3

If fgRemoveRestOfBlock is not set the generated code is:

G_M45811_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32
       488BF1               mov      rsi, rcx
G_M45811_IG02:
       4885F6               test     rsi, rsi
       7517                 jne      SHORT G_M45811_IG03
       E8E6FBFFFF           call     Program:ThrowArgumentNullException()
       48B9683057D239020000 mov      rcx, 0x239D2573068
       488B09               mov      rcx, gword ptr [rcx]
       E86465DD5D           call     System.Console:WriteLine(ref)
G_M45811_IG03:
       8B4608               mov      eax, dword ptr [rsi+8]
G_M45811_IG04:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret

The throw method is not inlined but there are various shortcomings:

  • the throw call is not moved to a cold block
  • ecx is killed by the call so it is moved to a call preserved register and that requires that register to be preserved
  • dead code that follows the call isn't removed

@mikedn
Copy link
Author

mikedn commented Jul 2, 2016

Diff summary:

Total bytes of diff: -105613
    diff is an improvement.
Top file improvements by size (bytes):
      -97812 : System.Private.CoreLib.dasm
       -3523 : System.Net.Http.dasm
       -1843 : System.Collections.Immutable.dasm
       -1726 : System.Runtime.Extensions.dasm
        -348 : System.Xml.ReaderWriter.dasm
7 total files with size differences.
Top method improvements by size (bytes):
       -3345 : System.Private.CoreLib.dasm - System.ThrowHelper:IfNullAndNullsAreIllegalThenThrow(ref,int)
       -1549 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.PropertyValue:GetFactory(ref):ref
        -780 : System.Private.CoreLib.dasm - System.Text.StringBuilder:AppendFormatHelper(ref,ref,struct):ref:this
        -303 : System.Net.Http.dasm - System.Net.Http.WinHttpHandler:SetRequestHandleOptions(ref):this
        -287 : System.Private.CoreLib.dasm - System.Threading.Tasks.GenericDelegateCache`2[__Canon,VoidTaskResult][System.__Canon,System.Threading.Tasks.VoidTaskResult]:.cctor()
1037 total methods with size differences.

Most of the improvements are in System.Private.CoreLib due to its high usage of throw helpers.

It may be worth noting that this works only if the throw helper wasn't marked as no-inline. If it's no-inline the JIT won't figure out that it does not return.

@omariom
Copy link

omariom commented Jul 2, 2016

@mikedn
Another case to check. ThrowHelper has methods that don't throw themselves but call other methods that throw. List's indexer calls such helper.
Will the change work for this case?

@mikedn
Copy link
Author

mikedn commented Jul 3, 2016

Another case to check. ThrowHelper has methods that don't throw themselves but call other methods that throw. List's indexer calls such helper. Will the change work for this case?

Sort of. The parameterless helper is inlined and the other one is not. Neither should be inlined but it may be problematic to treat the parameterless helper as no-return as well.

I'll see if something can be done about this once I figure out what's going on with FindEntry.

@mikedn
Copy link
Author

mikedn commented Jul 3, 2016

The FindEntry issue (and many similar ones) was caused by a lack of no-return information on the second attempt to inline ThrowArgumentNullException. The first time the CORINFO_FLG_BAD_INLINEE is set on the inline and the second time the JIT doesn't attempt to inline anymore and thus doesn't know that the inline is no-return. There needs to be a CORINFO_FLG_NO_RETURN flag to deal with this.

I'll update the diff summary, now there are no size regressions.

@BruceForstall
Copy link
Member

cc @AndyAyersMS

else if ((call->gtFlags & GTF_CALL_NO_RETURN) != 0)
{
szFailReason = "No return";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what motivated this change? Is there some downside to tail calling a no-return method (eg larger calling sequence?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tail call code is part of the epilog but with my change no epilog is generated anymore because the BB is converted to BBJ_THROW. The net result is that a test failed because an exception was no longer being thrown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the kind of thing that I'd like to see in a comment. A pointer to the test case that failed w/o this exclusion would be useful here too.

@AndyAyersMS
Copy link
Member

@mikedn Thanks for this, it looks very promising.

Can you work up some micro-benchmarks for these cases? You can use the pattern from the inline microbenchmarks I checked in recently -- they are very sensitive to small changes in codegen.

Also a general perf run would be nice. This is something that should be available soon from automation but in the meantime I can do some runs for you or show you how to run it -- however I won't get to it until later in the day.

@mikedn
Copy link
Author

mikedn commented Jul 5, 2016

Can you work up some micro-benchmarks for these cases?

I'll try. I doubt that we'll see anything significant in terms of performance, this is first and foremost a size optimization.

@mikedn
Copy link
Author

mikedn commented Jul 5, 2016

I doubt that we'll see anything significant in terms of performance, this is first and foremost a size optimization.

Well, they are right when they say that micro-benchmarks can prove anything. This little test runs in 500ms without my fix and 375ms with my fix. And if someone has the bad idea of applying [MethodImpl(MethodImplOptions.NoInlining)] to ThrowArgumentNullException then it's 575ms.

I'll see if I can make a JIT benchmark out of it. Hmm, project.json... shudder...

@AndyAyersMS
Copy link
Member

Please don't create a new project.json. Re-use one of the common ones we have set up for the JIT, like this one.

And here's a template for what the test should look like....

@mikedn
Copy link
Author

mikedn commented Jul 5, 2016

Please don't create a new project.json. Re-use one of the common ones we have set up for the JIT, like this one.

I know, I know, it's just that there are certain aspects of project.json and the build system that are annoying. Like the warnings generated by those test project.json...

I added a benchmark but I haven't been able to run it yet.

@omariom
Copy link

omariom commented Jul 5, 2016

@mikedn

And if someone has the bad idea of applying [MethodImpl(MethodImplOptions.NoInlining)] to ThrowArgumentNullException..

👇

System.Convert::ThrowByteOverflowException 
System.Convert::ThrowCharOverflowException 
System.Convert::ThrowInt16OverflowException 
System.Convert::ThrowInt32OverflowException 
System.Convert::ThrowInt64OverflowException 
System.Convert::ThrowSByteOverflowException 
System.Convert::ThrowUInt16OverflowException 
System.Convert::ThrowUInt32OverflowException 
System.Convert::ThrowUInt64OverflowException 

System.IO.Compression.DeflateStream::ThrowCannotReadFromDeflateStreamException 
System.IO.Compression.DeflateStream::ThrowCannotWriteToDeflateStreamException 
System.IO.Compression.DeflateStream::ThrowStreamClosedException 
System.IO.Compression.GZipStream::ThrowStreamClosedException 

👀

@AndyAyersMS
Copy link
Member

Thanks for the benchmark.

Since FindEntry was one of the motivating cases, can we measure any performance improvement there?

@mikedn
Copy link
Author

mikedn commented Jul 8, 2016

Since FindEntry was one of the motivating cases, can we measure any performance improvement there?

I tried a few things but the performance of FindEntry doesn't seem to be affected in any way by this change. I also played with a list indexer and that one is slightly faster with this change - 100,000,000 get_Item take 80ms without this change and 70ms with this change.

However, I don't think that the performance improvement shown by the previous benchmark and get_Item is directly related to this change. It seems that the JIT has an unrelated problem - sometimes throw blocks aren't moved to the cold region and this change happens to avoid that problem. If the throw block is already in the cold region then not inlining has no measurable performance impact..

@omariom
Copy link

omariom commented Jul 8, 2016

take 80ms without this change and 70ms with this change.

With your change it is 12.5% faster 👍

@mikedn
Copy link
Author

mikedn commented Jul 8, 2016

With your change it is 12.5% faster

12.5% makes it sound like it is a lot. But it really isn't, at 3.10GHz that my CPU has the difference is less than half a cycle per get_Item call. I'm not even sure how to explain such a difference. The number of instructions is identical in both cases. The branch is always taken so it shouldn't be a branch prediction issue. Most likely the throw code being in the fall-through path has a small impact on instruction fetching & decoding.

@AndyAyersMS
Copy link
Member

Smaller code is generally preferable even if it's not faster. So I'm happy with the CS benefits you see from this change, and, barring surprises elsewhere, think it is a nice improvement.

Here's my take on the work remaining to actually enable it:

  • The LegacyPolicy is intended to embody Desktop 4.6.1 / CoreCLR 1.0 behavior, and we don't want it to act on this new information. So this change needs to be policy specific. That means:
    • The observation needs to be informational and the policy needs to decide if the observation causes inlining to fail.
    • LegacyPolicy needs to ignore this new observation (which I believe it will do by default)
    • We'll need a new policy that is based on LegacyPolicy that acts on this information.
    • Propose this new policy as the default policy for post 1.0, and update the InlinePolicy factory method to make the new policy the default and allow fallback (via a COMPlus setting) to the LegacyPolicy.
  • I still think we want some kind of size limit, say a basic block limit. If the new policy is based on the LegacyPolicy you'll get this for free. When I look at handling this observation for new policies I can decide what limit might be appropriate.
  • Comments about why we don't propagate noinline back to the runtime for this case.
  • Comments about why we can't handle methods with return values.
  • Extend the test case with an example that will break if the return value restriction is lifted.
  • Extend the test case with an example showing how earlier noinline deductions can prevent this from kicking in. For instance if the inlinee is must throw and contains a STARG, we won't deduce must throw.
  • Verification that LegacyPolicy codegen is unaffected by this change (eg no jit diffs)
  • Some assessment of throughput (TP) impact. Expectation is that this will make the jit faster, but I'd like to be sure. We really should update the jitutils to support this kind of measurement (cc @russellhadley)
  • Some assessment of overall code quality impact (CQ). We can use the CoreCLR perf tests for this. These are not as easy to run as they should be, unfortunately.

I'd be happy to help out, especially with the TP/CQ assessments, since our tooling there is not yet what it should be.

@mikedn
Copy link
Author

mikedn commented Jul 9, 2016

Thanks for the detailed work plan! For a start I'll tweak the matching code to see how it affects the jit diff results so we can get an idea about what needs to go in the new policy.

@mikedn mikedn force-pushed the nothrowinl branch 2 times, most recently from a89781c to 245a8e0 Compare July 16, 2016 18:40
@russellhadley
Copy link

@AndyAyersMS, Yes the tool will fold all overloads together. This is some legacy behavior from an internal tool which was kept for convenience. The data that there were multiple instances is captured in the data structures in the tool so we could add some kind of notation to mark multiple instance entries.

@benaadams
Copy link
Member

@AndyAyersMS with #6634 which has a couple more I'm seeing a 68,608 byte reduction of filesize

@AndyAyersMS
Copy link
Member

@russellhadley I don't find folding to be a particularly useful default -- seems like instead we should report actual per-method data and provide aggregate views (by method, by class, by assembly, etc) to summarize.

@AndyAyersMS
Copy link
Member

@russellhadley Hmm, I suppose we're stuck with some degree of folding since the jit dump has simplified types like ref, struct, etc. I'll update the reporting to at least show the number of folded instances so it's obvious there is more than one method being aggregated.

@briansull
Copy link

Ideally we would like to be able to quickly find the top 10 methods that improved and the top 10 methods that regressed. It seems like aggregating multiple copies of a method together makes that task difficult. Is LastIndexOf really the method with the best improvement if there were 200 copies of that method aggregated, probably not.

@AndyAyersMS
Copy link
Member

Here's the result with a bit of tweaking to jit-analyze. May not be 100% accurate if you add or remove overloads, but better than nothing.

Note I am still looking at CHK builds of corlib since we still need to sort out how to get the tooling to do the right jit swaps to measure the REL corlib (see dotnet/jitutils#14).

Top method improvements by size (bytes):
       -2801 : System.Private.CoreLib.dasm - System.Array:LastIndexOf(ref,struct,int,int):int (9 methods)
       -2328 : System.Private.CoreLib.dasm - System.Array:Sort(ref,int,int,ref) (25 methods)
       -1996 : System.Private.CoreLib.dasm - System.Array:IndexOf(ref,struct,int,int):int (11 methods)
       -1979 : System.Private.CoreLib.dasm - System.Internal:SZArrayHelper(ref) (19 methods)
       -1800 : System.Private.CoreLib.dasm - System.SZArrayHelper:GetEnumerator():ref:this (19 methods)

@AndyAyersMS
Copy link
Member

PR for tool update is dotnet/jitutils#22

@mikedn
Copy link
Author

mikedn commented Aug 6, 2016

I belive jit-diff is actually folding overloads together and reporting results under the name of the first overload it sees -- which is why the improvment numbers for the methods are so high --

Note that this also happens for generic instantiations. That's how IfNullAndNullsAreIllegalThenThrow ended up being 3345 bytes smaller even if the method itself is quite small.

@mikedn mikedn deleted the nothrowinl branch April 10, 2017 05:50
@erozenfeld erozenfeld mentioned this pull request Apr 11, 2018
erozenfeld added a commit to erozenfeld/coreclr that referenced this pull request Apr 13, 2018
This is a follow-up to dotnet#17501 that fixed #17398.

#17398 was caused by a break in implicit contract between codegen and
gc pointer reporting in fully-interruptible mode: the latter assumed that
register gc pointer liveness doesn't change across calls while dotnet#6103 introduced
codegen where it wasn't true.

dotnet#17501 changed gc pointer reporting not to expect that register gc pointer liveness
doesn't change across calls.

This change inserts int3 after non-returning calls at the end of basic blocks
so that gc pointer liveness doesn't change across calls. This is additional
insurance in case any other place in runtime was dependent on that contract.
erozenfeld added a commit to erozenfeld/coreclr that referenced this pull request Apr 13, 2018
This is a follow-up to dotnet#17501 that fixed #17398.

gc pointer reporting in fully-interruptible mode: the latter assumed that
register gc pointer liveness doesn't change across calls while dotnet#6103 introduced
codegen where it wasn't true.

doesn't change across calls.

This change inserts int3 after non-returning calls at the end of basic blocks
so that gc pointer liveness doesn't change across calls. This is additional
insurance in case any other place in the runtime is dependent on that contract.
erozenfeld added a commit that referenced this pull request Apr 13, 2018
…7535)

This is a follow-up to #17501 that fixed #17398.

gc pointer reporting in fully-interruptible mode: the latter assumed that
register gc pointer liveness doesn't change across calls while #6103 introduced
codegen where it wasn't true.

doesn't change across calls.

This change inserts int3 after non-returning calls at the end of basic blocks
so that gc pointer liveness doesn't change across calls. This is additional
insurance in case any other place in the runtime is dependent on that contract.
damageboy added a commit to damageboy/simple-binary-encoding that referenced this pull request Jul 16, 2019
- Add a ThrowHelper utility class that provides non-returning methods
  for throwing exceptions, this in turn helps to reduce the code size,
  there-by increasing the chances of inlining simple (getter/setter)
  methods, which has cascading CQ improvments in its own.
  (See: dotnet/coreclr#6103)
- Convert checks in the form of `(x < 0 || x > max)` to `(uint) x > max`
  to further reduce code-size and increase inlining chances
- Update MSTest package deps
- Drop netfx.props (which I introduces in a previous PR to support
  building on linux) and replace with the now standard/canonical way of
  referencing NetFX BCL on all platforms with a <PackageReference>
damageboy added a commit to damageboy/simple-binary-encoding that referenced this pull request Jul 16, 2019
- Add a ThrowHelper utility class that provides non-returning methods
  for throwing exceptions, this in turn helps to reduce the code size,
  there-by increasing the chances of inlining simple (getter/setter)
  methods, which has cascading CQ improvments in its own.
  (See: dotnet/coreclr#6103)
- Convert checks in the form of `(x < 0 || x > max)` to `(uint) x > max`
  to further reduce code-size and increase inlining chances
- Update MSTest package deps
- Drop netfx.props (which I introduces in a previous PR to support
  building on linux) and replace with the now standard/canonical way of
  referencing NetFX BCL on all platforms with a <PackageReference>
- closes aeron-io#588
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Do not inline methods that never return

Commit migrated from dotnet/coreclr@c4da56e
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.