-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Improve the performance of enumerating over interface types. #62266
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
@EgorBo @AndyAyersMS Would guarded devirt + DynamicPGO help here? Is that our best avenue? |
Note that in your |
@davidfowl @geeknoid with DynamicPGO we devirtualize all calls here but are not able to avoid allocation (boxing) of an Enumerator (happens just once before the loop): I suspect in order to also optimize that small allocation we need that @AndyAyersMS's idea to use escape analysis. |
Note that getting rid of the allocation, especially when enumerating empty collections, was the motivation for trying out this idea . |
@huoyaoyuan Good point, I forgot to bring that up. I've always felt it was an architecture misstep to put in the versioning logic in such low-level abstractions as List and Dictionary, so there's no love lost for me there. But it does change the semantics relative to plain ol' enumeration. |
well, it's a small allocation that doesn't happen every iteration of that foreach so I personally don't think it is worth any special API while it could be eventually fixed in jit instead. |
I'm looking at this in the broader context of optimizing our service fleet. Small allocations done dozens or hundreds of times per request add up. We currently have ~300 uses of foreach throughout our code base, and we're just a set of libraries that services build on. So they'll add their own enumeration overhead on top. I highly doubt this can be systematically addressed by the compiler. There are many cases where an interface is in fact polymorphic in nature and so it can devirtualize. Or the code can potentially be used in LINQ queries and so it has to remain general and allocate an IEnumerable, etc. Enumeration is a very common pattern in code, dedicated optimizations to reduce the implicit overhead that enumeration induces is meaningful to the bottom line. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsBackground and motivationInterface-based programming is encouraged as a best practice. Unfortunately, enumerating collections that are exposed through interface types triggers memory allocations and involves 2 interface calls per item iterated. In other words, enumerating an interface is slow: List<int> l0 = new List<int>();
IList<int> l1 = new List<int>();
foreach (var x in l0)
{
// speedy due to List<T>.Enumerator
}
foreach (var x in l1)
{
// slow due to allocating an enumerator object, and making 2 interface calls
// (IEnumerator<T>.MoveNext and IEnumerator<T>.Current) for each item
} API ProposalPlease see https://github.com/geeknoid/AcceleratedEnumeration for a simple model that can deliver considerable perf benefits by avoiding allocations in many cases and cutting down the number of interface calls. The README file in that repo has perf numbers. The code I show provides benefits for most enumerations of IEnumerable, IList, IReadOnlyList, IDictionary, IReadOnlyDictionary, ISet, and IReadOnlySet. The basic idea is to do something like: foreach (var x in l.AcceleratedEnum())
{
} The above works like normal enumeration, just faster. Although the model I show here requires explicit coding by the programmer to leverage, it would be conceivable to build similar tech directly into C# so that enumeration of interface types just gets faster by magic. Note that my implementation also handles null collections, which is a nice simplification for the developer. You don't need to do an if check before you enumerate. API Usageforeach (var x in l.AcceleratedEnum())
{
} Alternative DesignsAs I mentioned above, you can imagine building this ugly stuff straight into the C# compiler. In fact, putting this into the compiler can likely deliver perf benefits by using byref semantics for the enumerator structs, which would avoid copying overhead. This is not possible right now since the C# compiler doesn't know how to deal with enumerators by reference. RisksThe design intentionally avoids extra error checks, so edge cases will throw different exceptions than in the non-accelerated case. In particular, the implementation of IEnumerator.Current when there isn't anything in the collection throws different exceptions. That could be remedied at the cost of a few cycles.
|
As @EgorBo says above, using PGO to de-abstract things like this is a long-term goal. Not clear how much progress we'll make on this in the next release, as we are still working through .net 7 plans. @geeknoid if you have realistic polymorphic examples in mind, it would be nice to add those to your perf suite so we can see how well some new ideas we have in mind for PGO might cope. |
@AndyAyersMS I've got a few examples already in my benchmark. I have IEnumerable implemented on top of different concrete types, or IList implemented on top of different concrete types. If you then write code such as:
DoSomething can be called with different implementations of IEnumerable. Now you're already generating different code for different instances of T, so conceivably you could generate different code for different concrete types implementing IEnumerable too. But I suspect a lot of this analysis will peter out at some point and leave you with persistent allocations and some interface based dispatch. |
FWIW We have LOTS of examples of this in ASP.NET and in dotnet/runtime. I made a set of changes in .NET 6 to remove IEnumerable.GetEnumerator allocations when the backing collection as a T[] (which is always is when using dependency injection). ASP.NET Core uses abstract IList, IReadOnlyList and IEnumerable ALOT. Recently we've been taking advantage of CollectionMarshal.AsSpan to remove enumerator allocations and speed up enumeration over Lists
Its worth optimizing it if we can.
@EgorBo can you explain why that is? I don't quite understand the connection. |
It can be simplified to this (even without PGO): static void Caller() => Callee(new FooStruct());
static void Callee(IFoo foo) => foo.SayHello();
public interface IFoo
{
void SayHello();
}
public struct FooStruct : IFoo
{
public void SayHello() => Console.WriteLine("Hello");
} where codegen for ; Method Program:Caller()
G_M23712_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M23712_IG02: ;; offset=0004H
48B918035904FE7F0000 mov rcx, 0x7FFE04590318 ; FooStruct
E82DBD9D5F call CORINFO_HELP_NEWSFAST
4883C008 add rax, 8
C60000 mov byte ptr [rax], 0
3900 cmp dword ptr [rax], eax
48B980550098FE010000 mov rcx, 0x1FE98005580 ; "Hello"
488B09 mov rcx, gword ptr [rcx]
E8D2FCFFFF call System.Console:WriteLine(System.String)
90 nop
;; bbWeight=1 PerfScore 9.00
G_M23712_IG03: ;; offset=002FH
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
; Total bytes of code: 52 so we were able to inline Callee but the boxing is still here (and it's useless). It's something we (Andy) eventually will fix I hope e.g. via escape-analysis (multi-use boxing problem).
My point was if we can optimize it in jit we should do it in jit, because this API looks like a leaking abstraction to me / temp workaround.
As Andy noted, it's something we can handle in JIT too - via context-sensitive PGO or/and mutliple guards (#59604) |
@EgorBo You mention "leaky abstraction", but I was thinking more in terms of "compiler support types". You can imagine the C# compiler transforming the foreach loop transparently, and users don't know what happened. I know that some heroics are possible in the JIT, but it cannot possibly deal with all cases, so it's useful to consider the full spectrum of possibilities. @davidfowl The MarshalAsSpan thing is nice. But wouldn't it be even nicer if the compiler just did this transformation for you? Note that even if you completely ignore the implementation of my proposal, there is a clear big win possible with very little change in the C# compiler: If you have a foreach loop over an IList or IReadOnlyList, it is substantially more efficient to implement the foreach by indexing rather than using GetEnumerator. You go from 2 interface calls per item down to 1. You lose the "feature" that enumeration throws if a mutation occurs during iteration, so presumably to enable this feature would require a new compiler switch since the semantics are slightly different. |
@geeknoid btw, your API noticeably regresses performance for cases when underlying enumeration is not an array/list/readonlylist - it will do several casts some of them aren't cheap (such as cast to covariant IReadOnlyCollection). I'd prefer PGO to do this work since it knows for sure what kinds of objects are hidden under abstractions |
@EgorBo have you tried running all these tests with PGO? Curious to see what the difference is between the default enumeration and the accelerated enumeration is with PGO in play. There is still a bunch of work ahead to fully de-abstract the simple case where PGO sees that one collection type predominates. I should sketch out what we eventually need to do. It is not necessarily heroics as each of the sub-pieces is useful on its own. |
Thanks to some help from @EgorBo , I updated my benchmark. The JIT was doing some things that gave an unfair advantage to the non-accelerated code. I cancelled those out now and it ends up making the accelerated enumeration technique look even more compelling overall. If the JIT can achieve this, and perhaps more, it'll be great. In the meantime, we're definitely going to leverage my hack on our code base and get us those beautiful free cycles back :-) |
Going to close this as its providing a new form of enumeration within the BCL is an extremely complicated task that would require revving the whole ecosystem to support. Getting codegen improvements in the form of DPGO and guarded devirtualization is the most viable path forward here. |
Background and motivation
Interface-based programming is encouraged as a best practice. Unfortunately, enumerating collections that are exposed through interface types triggers memory allocations and involves 2 interface calls per item iterated. In other words, enumerating an interface is slow:
API Proposal
Please see https://github.com/geeknoid/AcceleratedEnumeration for a simple model that can deliver considerable perf benefits by avoiding allocations in many cases and cutting down the number of interface calls. The README file in that repo has perf numbers. The code I show provides benefits for most enumerations of IEnumerable, IList, IReadOnlyList, IDictionary, IReadOnlyDictionary, ISet, and IReadOnlySet.
The basic idea is to do something like:
The above works like normal enumeration, just faster. Although the model I show here requires explicit coding by the programmer to leverage, it would be conceivable to build similar tech directly into C# so that enumeration of interface types just gets faster by magic.
Note that my implementation also handles null collections, which is a nice simplification for the developer. You don't need to do an if check before you enumerate.
API Usage
Alternative Designs
As I mentioned above, you can imagine building this ugly stuff straight into the C# compiler. In fact, putting this into the compiler can likely deliver perf benefits by using byref semantics for the enumerator structs, which would avoid copying overhead. This is not possible right now since the C# compiler doesn't know how to deal with enumerators by reference.
Risks
The design intentionally avoids extra error checks, so edge cases will throw different exceptions than in the non-accelerated case. In particular, the implementation of IEnumerator.Current when there isn't anything in the collection throws different exceptions. That could be remedied at the cost of a few cycles.
The text was updated successfully, but these errors were encountered: