Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Reduce CPU consumption by Timer's FireNextTimers #20302

Merged
merged 2 commits into from
Oct 9, 2018

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Oct 8, 2018

(Approach is based on brainstorming with @vancem.)

Historically, in certain applications that were Timer-heavy (either via direct use or via wrappers like Task.Delay), timer-related operations like creation (ctor), destruction (Dispose), and firing (internally in FireNextTimers) were relatively expensive. A single linked queue of timers was maintained and protected by a single lock, such that every creation, destruction, and firing would require taking that lock and operating on the list.

In .NET Core 2.1, we improved this significantly to reduce contention by splitting the single lock and queue into N partitions, each with its own lock and queue (and native timer that triggers the processing of that queue). This enables lots of threads that would otherwise all be contending for timer creation/destruction/firing to now spread the load across the various partitions. This made a significantly positive and measurable impact on these timer-heavy workloads, in particular for workloads that created and destroyed lots of timers with most never actually firing (e.g. where timers are used to implement timeouts, and most things don't timeout).

However, we still see some apps that rely heavily on timers firing, in particular with apps that have tens of thousands or hundreds of thousands of periodic timers, with lots of time spent inside FireNextTimers (the function that's invoked when the native timer for a partition fires to process whatever set of timers may be ready to fire in its queue). This operation currently walks the whole list of that queue's timers, such that it needs to touch and do a little work for every scheduled timer in that partition, even if only one or a few timers actually need to fire. The more timers are scheduled, even if they're for a time far in the future, the more costly FireNextTimers becomes. And as FireNextTimers is invoked while holding that partition's lock, this also then slows down any code paths trying to create/destroy timers on the same partition.

This PR attempts to address the most impactful cases of that. Instead of each partition maintaining a single queue, we simply split the queue into two: a "short" list that contains all scheduled timers with a next firing time that's <= some absolute threshold, and a "long" list for the rest. When FireNextTimers is invoked, we walk the "short" list, processing it as we do today. If the current time is less than or equal to the absolute threshold, then we know we don't need to examine the long list and can skip it and the (hopefully) majority of timers it contains. If, however, we're beyond that time or the short list becomes empty, we continue to process the long list as we do today, with the slight modification that we then also move to the short list anything with a due time that puts it at or under the threshold, which is reset to point to a time short into the future. When new timers are added, we just add them to the appropriate list based on their due time.

The theory behind this is that the problematic cases are when we have lots of long-lived timers that rarely fire but today we're having to examine them every time any timer fires; by maintaining a short list that ideally has only a small subset of the timers, we can avoid touching the majority of the timers each time FireNextTimers is called, on average. This doesn't change the O(N) complexity of FireNextTimers, but it should often reduce the size of N significantly. Synthetic workloads have shown that it often reduces the cost of FireNextTimers to just 5-10% of what it was previously, and without increasing the cost of the fast add/dispose path measurably.

(An alternative approach considered is to use a priority queue / heap for the timer list. This would allow for FireNextTimers to pull off in O(log N) time each of the next timers to fire, hopefully making FireNextTimers much cheaper. However, it comes at the expense of making creation and destruction also O(log N) instead of O(1). And in cases where all timers in the list needed to fire, it would make FireNextTimers O(N log N) instead of O(N). It is, however, an approach that could be pursued if this approach proves less effective in real-world applications than expected.)

As a simulation of a workload, consider a server handling web socket connections. Each web socket connection maintains its own timer that fires every, say, 10 seconds as a keep alive (a real workload is most likely higher, like 30 seconds or a minute, but I’ve compressed it for testing purposes). And let’s say over the course of 10 seconds, the server receives a bunch of new requests. We’ll end up with hundreds of thousands of periodic timers that are distributed fairly evenly across both timer partitions and firing time, and we can represent that with a repro like:

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static void Main()
    {
        var timers = new ConcurrentQueue<Timer>();

        Parallel.For(0, Environment.ProcessorCount, _ =>
        {
            TimeSpan ts = TimeSpan.FromSeconds(10);
            DateTimeOffset end = DateTime.UtcNow + ts;
            while (DateTime.UtcNow < end)
            {
                for (int i = 0; i < 100; i++)
                {
                    timers.Enqueue(new Timer(s => { }, null, ts, ts));
                }
                Thread.Sleep(1);
            }
        });

        Console.WriteLine("Timers: " + timers.Count);
        Console.ReadLine();
        GC.KeepAlive(timers);
    }
}

For testing purposes, I instrumented the code to see, each time FireNextTimer runs, how many timers it touches and how long the operation takes. With this PR, on average it processes only ~10% of all timers associated with the given partition each time FireNextTimers is invoked, and it similarly ends up taking only ~10% as long as FireNextTimers does in the same situation before this PR, e.g. taking 2ms instead of 20ms.

The situation remains largely the same if lots of single-fire short-lived timers are introduced on top of this, e.g.

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        var timers = new ConcurrentQueue<Timer>();

        Parallel.For(0, Environment.ProcessorCount, _ =>
        {
            TimeSpan ts = TimeSpan.FromSeconds(10);
            DateTimeOffset end = DateTime.UtcNow + ts;
            while (DateTime.UtcNow < end)
            {
                for (int i = 0; i < 100; i++)
                {
                    timers.Enqueue(new Timer(s => { }, null, ts, ts));
                }
                Thread.Sleep(1);
            }
        });

        Console.WriteLine("Timers: " + timers.Count);

        var sw = new Stopwatch();
        for (int trial = 0; trial < 100; trial++)
        {
            sw.Restart();
            for (int i = 0; i < 100; i++) await Task.Delay(1);
            sw.Stop();
            Console.WriteLine(sw.Elapsed);
        }

        Console.ReadLine();
        GC.KeepAlive(timers);
    }
}

The PR doesn’t change the time complexity of any of the operations, and in only a few situations does it add a small amount of cost to existing operations, e.g. moving a timer between lists when it crosses the threshold and the associated branch to determine whether that’s required, while at the same time leading to significant savings in the problematic scenarios.

cc: @vancem, @kouvel

@stephentoub stephentoub added the tenet-performance Performance related issue label Oct 8, 2018
@stephentoub stephentoub added this to the 3.0 milestone Oct 8, 2018
@stephentoub
Copy link
Member Author

@kurtschenk, would you be able to try this out with your workload (#14527 (comment)) and see what kind of difference it makes, whether it's sufficient, etc.?

@vancem
Copy link

vancem commented Oct 8, 2018

@kurtschenk. If you could also characterize the timer entries. What is interesting is

  1. The histogram of time spans
  2. Whether those timer entries are likley to be canceled (as would be the case for a timeout) or not.

I looked at the previous ETL.ZIP files you sent us and they don't seem to have the FrameworkEventSource events turned on (I am not exactly sure why). If you were to collect a trace if your scneario with the default PerfView parameters that woulld also be helpful.

else
{
if (FrameworkEventSource.IsInitialized && FrameworkEventSource.Log.IsEnabled(EventLevel.Informational, FrameworkEventSource.Keywords.ThreadTransfer))
FrameworkEventSource.Log.ThreadTransferSendObj(this, 1, string.Empty, true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an additional optional int parameters to the ThreadTransferSendObj at the end and then pass the duration when logging this event. This would give us useful information from the field about the statistics of durations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an additional optional int parameters to the ThreadTransferSendObj at the end and then pass the duration when logging this event.

Sounds good, but let's do it separately from this PR.

As for the mechanics, it's ok to add arguments to an existing event like that? How does one define an optional argument with EventSource?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with a separate PR. You can add new arguments to the END of an event (because any existing parsers simply ignore the 'rest' of the payload).

I have never actually used an optional argument for an eventSource method, but it should work as the IL will have the extra arg, the the compiler just knows to add it at the call site as needed. However in this case the issue is mute because there is already a 'wrapper' method around the real logging method because we need to convert the Timer Object to a ID. Thus you can make it optional in the wrapper, and required in the logging method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus you can make it optional in the wrapper, and required in the logging method.

Sorry, I know how to make something optional in C# / the IL. I thought you were suggesting of making it optional in the actual event, i.e. that there was some kind of optional notion in ETW and that could be represented in an [Event(...)] description. If it's ok to just update the [Event(...)] with an additional argument and that won't break a consumer, then no problem.

@stephentoub stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 8, 2018
@stephentoub
Copy link
Member Author

(Working out some issues...)

@davidfowl
Copy link
Member

Libuv uses a heap (just an FYI).

@stephentoub
Copy link
Member Author

Yup, I looked at what they were doing.

@davidfowl
Copy link
Member

I have to ask, where do we get real world data for these problems? It seems like we should be doing a heap implementation then comparing it to all the cases we already optimize for today to see how much worse it is.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

It seems like we should be doing a heap implementation then comparing it to all the cases we already optimize for today to see how much worse it is.

That's where I started. Vance suggested this direction. In my opinion, the heap-based approach is easier to explain and convince myself of its correctness. However, the worst case is significantly worse than what's already there today. For example, adding/removing today is O(1), and if you assume that the majority of timers are short-lived and never fire (e.g. timeouts for fast operations), then you want to optimize for adding/removing... on my machine, currently a million timers and then disposing of those is ~.75s, but with the heap approach, because adding/removing is O(N), for 1M timers that grows to ~1.0s (changing the order of things and just creating and disposing a timer 1M times is ~.22s both before and after the heap) But 1M timers active is also a heck of a lot of timers.

@davidfowl
Copy link
Member

davidfowl commented Oct 9, 2018

What’s the guidance on how to use timers? Today in ASP.NET we usually use a single timer and attach a bunch of work to it rather than create short lived timers.

There are a couple of patterns that may end up creating more timers than normal that I can think of in a typical server application:

  • The long lived loop using Task.Delay to asynchronously sleep between work
  • Timing our asynchronous operations by using Task.Delay in conjunction with WhenAny to yield an await before things are actually complete.

What do we recommend people do to use timers optimally?

@stephentoub
Copy link
Member Author

There are a couple of patterns that may end up creating more timers than normal that I can think of in a typical server application

I've seen a non-trivial number of server applications with the pattern I described in my PR description, yielding many periodic timers.

Today in ASP.NET we usually use a single timer and attach a bunch of work to it rather than create short lived timers.

That's kind of what System.Threading.Timer is: a single timer that fires and then runs through a list of work associated with it that's logically time/action pairs to see which should be invoked. This assumes that if you create timer instances for different absolute milliseconds that you expect them to be treated separately: if you're ok coalescing, you can of course do better, e.g. https://blogs.msdn.microsoft.com/pfxteam/2011/12/03/coalescing-cancellationtokens-from-timeouts/.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

Here's the general heap-based approach, btw:
stephentoub@4de0420?w=1
In the end, I may go back to it. This approach has its own complications, and it's nowhere near as simple as when I started with it.

@svick
Copy link

svick commented Oct 9, 2018

However, it comes at the expense of making creation and destruction also O(log N) instead of O(1).

More advanced heaps have better time complexities. I have no idea if using them would actually improve performance, but it might be worth trying.

And in cases where all timers in the list needed to fire, it would make FireNextTimers O(N log N) instead of O(N).

I think it would be possible to use a modified version of the algorithm to build a heap in O(N) time: First find all timers that need to fire, then replace them with timers from the end of the heap (as is done in normal heap deletion) and finally heapify the moved timers.

If the number of removed items k is small, this will be O(k log N). But it will never go over O(N) for large k.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

First find all timers that need to fire, then replace them with timers from the end of the heap (as is done in normal heap deletion) and finally heapify the moved timers.

The "first find all timers that need to fire" part is O(N) if done as a linear search, which defeats the whole purpose of this change, which is to do better than O(N) in the majority case (or, at least with a much smaller N).

@svick
Copy link

svick commented Oct 9, 2018

@stephentoub I think you could make use of the heap property to make it O(k) (where k is again the number of nodes to remove). You process the heap as a binary tree recursively, but stop the recursion whenever the current node does not need to fire.

I think the total number of examined nodes should be at most 2k+1, which makes the time complexity O(k).

In pseudo-C#:

IEnumerable<Node> FindNodes(Node node, uint elapsed)
{
    // reached the bottom, so stop recursion
    if (node == null)
        yield break;

    // found a node that does not need to fire, so stop recursion too
    if (elapsed < node.DueTime)
        yield break;

    yield return node;

    yield foreach FindNodes(node.Left);
    yield foreach FindNodes(node.Right);
}

(Of course the actual implementation would look very differently, this is just the algorithm.)

For example, if elapsed was 10, it could look like the following image, where green are returned nodes, blue nodes are examined, but not returned and white nodes are not examined:

(Based on an image from Wikipedia.)

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

I think you could make use of the heap property to make it O(k)

That's probably true, though likely with significantly higher constants. The bigger concern for me though is making the ctor/dispose O(log N) instead of O(1), and the other heaps you referred to are usually ammortized O(1) iff the inserts are done contiguously, which isn't the case here.

Regardless, all of that ends up adding further complication to the heap approach. This approach ends up being pretty simple (most of the changes in the PR are just explanatory comments) while still being quite effective in the cases that are actually problematic and while having ctor/disposal be a very simple add/remove from a linked list.

I have no idea if using them would actually improve performance, but it might be worth trying.

You're welcome to try. At present I'm not seeing a need so I'm not planning to. You can start from my previously linked commit if it helps.

@stephentoub stephentoub removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 9, 2018
@benaadams
Copy link
Member

Timer generations; like the GC :)

More advanced heaps have better time complexities. I have no idea if using them would actually improve performance, but it might be worth trying.

Sounds more like something to investigate as a follow up; since this PR improves the current situation it valid as it stands. It then becomes the next baseline for the next improvement.

@svick
Copy link

svick commented Oct 9, 2018

@stephentoub

The bigger concern for me though is making the ctor/dispose O(N) instead of O(1).

Isn't it O(log N)? Since a timer constructor/dispose corresponds to a heap insert/delete, which is O(log N).

But I understand that's still not good enough. Which is why I suggested looking into alternative heaps, like the Fibonacci heap. But I didn't realize that the other heaps still have O(log N) delete. And Wikipedia claims they are not actually better in practice.

So, I think this is still worth investigating, even if the chance of actually improving performance is fairly low.

@benaadams

Sounds more like something to investigate as a follow up; since this PR improves the current situation it valid as it stands. It then becomes the next baseline for the next improvement.

I definitely agree with that. I did consider opening a new issue, instead of commenting here, but then decided it would be better to keep the discussion in one place.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

Isn't it O(log N)?

Yes, that was a typo.

So, I think this is still worth investigating, even if the chance of actually improving performance is fairly low.
I definitely agree with that. I did consider opening a new issue, instead of commenting here, but then decided it would be better to keep the discussion in one place.

I personally don't think it's worth an issue, unless you're actively planning to follow-up on it. It's not clear that there's even an impactful problem that remains after this PR (in general I'm not a fan of having issues opened for possible investigations that could be done to potentially improve something: in my experience, they just end up sitting there forever). If you have a real-world case where there is, that'd be great to understand and an issue would be fine. My $.02.

@stephentoub
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test please ("ld: can't open output file for writing: paltest_duplicatehandle_test12, errno=28 for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation)

@vancem
Copy link

vancem commented Oct 9, 2018

@stephentoub and I have iterated off line a bit on this. The heap based approach is an alternative, but frankly it NOT well suited to the VERY COMMON 'timeout' case where you add timers only to remove them before they fire (most timeouts don't happen). We have data that points to this as the most important case. What you want is to be 'lazy' and only start 'sorting' things when the are getting close to being fired. That way insertion and deletion of timeout timers is as cheep as possible (they don't need to protect an invariant).

Interestingly, this change can be thought of as the first step in an more general algorithm. The idea is that from the 'Fire' perspective, you need to find the minimum, so you need to sort, at least enough to find that minimum. Heaps do this, but so does a partial quicksort algorithm. Imagine one iteration of the Quicksort algorithm. It picks a pivot (in this case we picked a fixed one of 300 msec, but we could have used the standard median of 3), and we sorted the list into those above it and those below it. Now as long as the 'below' list is non-empty, to find the minimum I only have to look at that half. I can do this recursively pivoting again, and again, until the 'below' region is 1 element. I have found my minimum, but I remember those pivots, the next time I need to search I only have to look at the last partition. When that is exhausted, the next large partition, etc. The result is that I only VERY rarely look at the whole list (or even half). Thus fire becomes very efficient (our goal)

What makes this algorithm really nice is that when you add a large timeout, you are almost certainly in the 'big half' of the list. Thus with one compareision you can put yourself in the correct list in this very common case (so it is log(N) in general, but O(1) in a common case). It also works well when many timers go off in the same tick (common for another case). In that case, all the timers are close (in the same partition, and so it is O(1) to fire them.

This is where I think we should go if we want to take this next step. However doing this to just 'one level (pivot only once, and thus only segregate into a big and small list), gets you frankly most of the benefit, and is simpler. That is why I like this approach. If we still find it is not good enough, we can add a quicksort-like extension.

Historically, in certain applications that were Timer-heavy (either via direct use or via wrappers like Task.Delay), timer-related operations like creation (ctor), destruction (Dispose), and firing (internally in FireNextTimers) were relatively expensive. A single linked queue of timers was maintained and protected by a single lock, such that every creation, destruction, and firing would require taking that lock and operating on the list.

In .NET Core 2.1, we improved this significantly to reduce contention by splitting the single lock and queue into N partitions, each with its own lock and queue (and native timer that triggers the processing of that queue). This enables lots of threads that would otherwise all be contending for timer creation/destruction/firing to now spread the load across the various partitions. This made a significantly positive and measurable impact on these timer-heavy workloads, in particular for workloads that created and destroyed lots of timers with most never actually firing (e.g. where timers are used to implement timeouts, and most things don't timeout).

However, we still see some apps that rely heavily on timers firing, in particular with apps that have tens of thousands or hundreds of thousands of periodic timers, with lots of time spent inside FireNextTimers (the function that's invoked when the native timer for a partition fires to process whatever set of timers may be ready to fire in its queue). This operation currently walks the whole list of that queue's timers, such that it needs to touch and do a little work for every scheduled timer in that partition, even if only one or a few timers actually need to fire. The more timers are scheduled, even if they're for a time far in the future, the more costly FireNextTimers becomes. And as FireNextTimers is invoked while holding that partition's lock, this also then slows down any code paths trying to create/destroy timers on the same partition.

This PR attempts to address the most impactful cases of that. Instead of each partition maintaining a single queue, we simply split the queue into two: a "short" list that contains all scheduled timers with a next firing time that's <= some absolute threshold, and a "long" list for the rest. When FireNextTimers is invoked, we walk the "short" list, processing it as we do today. If the current time is less than or equal to the absolute threshold, then we know we don't need to examine the long list and can skip it and the (hopefully) majority of timers it contains. If, however, we're beyond that time or the short list becomes empty, we continue to process the long list as we do today, with the slight modification that we then also move to the short list anything with a due time that puts it at or under the threshold, which is reset to point to a time short into the future. When new timers are added, we just add them to the appropriate list based on their due time.

The theory behind this is that the problematic cases are when we have lots of long-lived timers that rarely fire but today we're having to examine them every time any timer fires; by maintaining a short list that ideally has only a small subset of the timers, we can avoid touching the majority of the timers each time FireNextTimers is called, on average. This doesn't change the O(N) complexity of FireNextTimers, but it should often reduce the size of N significantly.  Synthetic workloads have shown that it often reduces the cost of FireNextTimers to just 5-10% of what it was previously, and without increasing the cost of the fast add/dispose path measurably.

(An alternative approach considered is to use a priority queue / heap for the timer list. This would allow for FireNextTimers to pull off in O(log N) time each of the next timers to fire, hopefully making FireNextTimers much cheaper. However, it comes at the expense of making creation and destruction also O(log N) instead of O(1). And in cases where all timers in the list needed to fire, it would make FireNextTimers O(N log N) instead of O(N). It is, however, an approach that could be pursued if this approach proves less effective in real-world applications than expected.)
if (listNum == 0)
{
bool processLongList = startAbsoluteThreshold - nowTicks < 0 || m_shortTimers == null;
if (!processLongList)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design is simple, but in the following scenario.

  • You have MANY timeout timers (say 10 second durations).
  • You have threads that keeps creating a short delays (e.g. Delay(1)).

Under such conditions, every time the Fire method happens (which is on every tick, since the delay is short), you end up having to look at the long list (to determine the next fire time).

You could avoid this by either

  1. Tracking the minimum of the long list. OR
  2. Simply setting nextAppDomainTimerDuration to what is left between now and startAbsoluteThreshold.

By delaying you give time for either more short timers to arrive, or for other timeout timers to die. But what is nice is that you can say that you will never look at the long list more frequently than once every 333 msec, which is a very nice thing to be able to say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) seems simple and reasonable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that one is very easy and will get the job done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit with that change.

Improve handling of the case where the short list ends up empty.

Also fix an issue I noticed around the setting of m_currentAbsoluteThreshold.
@davidfowl
Copy link
Member

We have data that points to this as the most important case.

@vancem @stephentoub Where is that data?

@stephentoub
Copy link
Member Author

Where is that data?

Case after case where developers have come to us with concerns about contention on ctor/dispose, the primary cases that motivated #14527; developers saw contention because their code was creating and disposing lots of timers but they were rarely firing. I don't believe we have it archived anywhere, other than email / discussion list / github / stackoverflow etc. discussions. It also matches coding patterns you see that are very common, where timers are used to implement timeouts; way more often than not, those timeouts represent an error case, something that shouldn't happen, so you're starting some operation, creating a timer to cancel it in case things go awry, then the operation finishes, and you dispose of the timer, e.g.

using (var cts = new CancellationTokenSource(timeout)) await OperationAsync(cts.Token);

@davidfowl
Copy link
Member

Case after case where developers have come to us with concerns about contention on ctor/dispose, the primary cases that motivated #14527; developers saw contention because their code was creating and disposing lots of timers but they were rarely firing. I don't believe we have it archived anywhere, other than email / discussion list / github / stackoverflow etc. discussions. It also matches coding patterns you see that are very common, where timers are used to implement timeouts; way more often than not, those timeouts represent an error case, something that shouldn't happen, so you're starting some operation, creating a timer to cancel it in case things go awry, then the operation finishes, and you dispose of the timer, e.g.

Great. BTW I'm not asking for proof per se. I'm just looking for coding patterns to make guidance (I'm working on a talk).

@kurtschenk
Copy link

kurtschenk commented Oct 9, 2018

If you could also characterize the timer entries. What is interesting is

  1. The histogram of time spans
  2. Whether those timer entries are likley to be canceled (as would be the case for a timeout) or not.

@vancem
Regarding the scenario I described here: #14527 (comment)

For 1: The TimeSpan used in Task.Delay is always the same. The default is 20s, but currently we have set to 60 which we had to do to get the performance we needed.
See ActorProxyEventExtensions.cs

And for 2, timers are not cancelled. When a WebSocket connection is made, the Actor subscription is set up in order to send messages back to the WebSocket client. This remains as long as the client is connected to the server.

@stephentoub
Copy link
Member Author

Great. BTW I'm not asking for proof per se. I'm just looking for coding patterns to make guidance (I'm working on a talk).

Cool.

@stephentoub
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test please (https://github.com/dotnet/coreclr/issues/20324)

@stephentoub
Copy link
Member Author

stephentoub commented Oct 9, 2018

@kurtschenk, thanks. Is it fair to assume that the web socket connections are created fairly evenly distributed across time, in which case the Timers would also be evenly distributed, or is it more likely that the web socket connections are created in a burst and thus the Timers are tightly clumped?

@stephentoub
Copy link
Member Author

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test please (https://github.com/dotnet/coreclr/issues/20296)

@vancem
Copy link

vancem commented Oct 9, 2018

Thanks @kurtschenk based on what you tell me, I think Stephen's fix should help substantially. From what you say you have 1000s of entries that are 20 seconds long they may be bunched but more likely spread over time by random chance. On average there will be 50 / sec or 17 every 333 msec. Basically the new algorithm will pull off 17 of them and those will not visit the 1000 other entries. Every 17 it will 'refill' and enumerate them all. Thus you should cut the time in FireNextTimers by a factor of at least 10 and hopefully more like 17. Thus your 30% CPU will drop to 3% or less (hopefully).

The bigger the number of timers, the more dramatic the improvement should be.

Of course we need to validate. Note that an PerfView trace with the new code would be extra helpful.

@vancem
Copy link

vancem commented Oct 9, 2018

My only remaining qualm is with respect to testing, but I don't think we need to block testing for that. Ideally we have some tests that test that timers really go off pretty close to when they should and we test with various mixes of time mixes (all above 333msec, a mix, things that strattle 333 msec etc).

@stephentoub
Copy link
Member Author

My only remaining qualm is with respect to testing

Yup. I'm working on adding more to corefx.

@stephentoub stephentoub merged commit cfe51dd into dotnet:master Oct 9, 2018
@stephentoub stephentoub deleted the timerpartition branch October 9, 2018 22:05
@veikkoeeva
Copy link

A quick note that when using Rx.NET the use I seem to come across often is having some sort bound buffer such as "50 items or one second grouped into buckets with some terms" or indeed "for every bucket with some terms, alarm if not seen for x seconds" and there might be hundreds of or thousands of these buckets. Rx.NET in general might be something that produces timer traffic -- though it might be the usage is optimized too (I haven't checked).

@kevingosse
Copy link

For what it's worth, the timer scalability issue hit us pretty hard at Criteo (1000-2000 QPS with < 100 ms timeouts), especially since we're still running on .NET Framework (so we do not have the timers partitions, and we do not have the spinning fix on CPUs with a large number of cores). Our approach around that has been to use a simple implementation of the timer-wheel algorithm. In short, all timers lower than the threshold are moved to a sort of ring-buffer using the relative due-time as key (since that relative due-time is bounded, the implementation is really just an array of lists). This ensures O(1) complexity for inserting timers and finding the next ones to tick. Removal is not needed since all the timers are very short lived (less than a few hundred milliseconds), we just flag them as cancelled, unsubscribe the callback (to prevent keeping alive arbitrary amounts of memory) and they'll be dequeued soon enough.

Obviously this algorithm can only work with a threshold. But since you're introducing one with your short timers/long timers queues, I wonder if it's worth considering for the short queue.

@vancem
Copy link

vancem commented Oct 10, 2018

@kevingosse

In general, it would be VERY helpful for anyone who has issues with the current Timer scalability to log there issue here (or we can create another issue that this one links to). Ideally such information includes a small program that shows the expected mix and scale (thus X threads adding Y timers per sec with duration statistics Z), along with whether they are canceled or not. This would help us understand if we need to do more.

One very simple change we could make to the current implementation is to make the ShortTimersThresholdMilliseconds dyanmic (it would try to tune to get say 1/10 of the entries in the short list). That would substantially help your case (where today it probably does not help much at all since all your entries are < 333 msec).

Other data strcutures (like what you describe) are possible, but I still like my quickSort pivots (covers lots of the cases and works well in the common cases). But first we need to determine if we care. (Hence my first paragraph above).

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
* Reduce CPU consumption by Timer's FireNextTimers

Historically, in certain applications that were Timer-heavy (either via direct use or via wrappers like Task.Delay), timer-related operations like creation (ctor), destruction (Dispose), and firing (internally in FireNextTimers) were relatively expensive. A single linked queue of timers was maintained and protected by a single lock, such that every creation, destruction, and firing would require taking that lock and operating on the list.

In .NET Core 2.1, we improved this significantly to reduce contention by splitting the single lock and queue into N partitions, each with its own lock and queue (and native timer that triggers the processing of that queue). This enables lots of threads that would otherwise all be contending for timer creation/destruction/firing to now spread the load across the various partitions. This made a significantly positive and measurable impact on these timer-heavy workloads, in particular for workloads that created and destroyed lots of timers with most never actually firing (e.g. where timers are used to implement timeouts, and most things don't timeout).

However, we still see some apps that rely heavily on timers firing, in particular with apps that have tens of thousands or hundreds of thousands of periodic timers, with lots of time spent inside FireNextTimers (the function that's invoked when the native timer for a partition fires to process whatever set of timers may be ready to fire in its queue). This operation currently walks the whole list of that queue's timers, such that it needs to touch and do a little work for every scheduled timer in that partition, even if only one or a few timers actually need to fire. The more timers are scheduled, even if they're for a time far in the future, the more costly FireNextTimers becomes. And as FireNextTimers is invoked while holding that partition's lock, this also then slows down any code paths trying to create/destroy timers on the same partition.

This PR attempts to address the most impactful cases of that. Instead of each partition maintaining a single queue, we simply split the queue into two: a "short" list that contains all scheduled timers with a next firing time that's <= some absolute threshold, and a "long" list for the rest. When FireNextTimers is invoked, we walk the "short" list, processing it as we do today. If the current time is less than or equal to the absolute threshold, then we know we don't need to examine the long list and can skip it and the (hopefully) majority of timers it contains. If, however, we're beyond that time or the short list becomes empty, we continue to process the long list as we do today, with the slight modification that we then also move to the short list anything with a due time that puts it at or under the threshold, which is reset to point to a time short into the future. When new timers are added, we just add them to the appropriate list based on their due time.

The theory behind this is that the problematic cases are when we have lots of long-lived timers that rarely fire but today we're having to examine them every time any timer fires; by maintaining a short list that ideally has only a small subset of the timers, we can avoid touching the majority of the timers each time FireNextTimers is called, on average. This doesn't change the O(N) complexity of FireNextTimers, but it should often reduce the size of N significantly.  Synthetic workloads have shown that it often reduces the cost of FireNextTimers to just 5-10% of what it was previously, and without increasing the cost of the fast add/dispose path measurably.

(An alternative approach considered is to use a priority queue / heap for the timer list. This would allow for FireNextTimers to pull off in O(log N) time each of the next timers to fire, hopefully making FireNextTimers much cheaper. However, it comes at the expense of making creation and destruction also O(log N) instead of O(1). And in cases where all timers in the list needed to fire, it would make FireNextTimers O(N log N) instead of O(N). It is, however, an approach that could be pursued if this approach proves less effective in real-world applications than expected.)

* Address PR feedback

Improve handling of the case where the short list ends up empty.

Also fix an issue I noticed around the setting of m_currentAbsoluteThreshold.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants