-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Max one threadpool request outstanding When a steal takes place, store location for next thread to continue (steal) search from that point
break; | ||
} | ||
count = prev; | ||
ThreadPool.RequestWorkerThread(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work. It would be fine (maybe even better) if those work items are very short, but they may not be and it may be beneficial to have multiple threads processing them. I have some thoughts on how to approach this issue and other perf issues here, but I'd have to prototype them to validate it. I think it will need some design changes to solve these problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work.
Except... one of the things we were trying to fix in the reverted PR https://github.com/dotnet/coreclr/issues/14017 was the over requesting of threads; because the very first thing Dispatch
if it dequeues a work item; before running it, is to request another thread (as does Enqueue - so double requesting):
coreclr/src/System.Private.CoreLib/src/System/Threading/ThreadPool.cs
Lines 561 to 564 in 208ea16
// If we found work, there may be more work. Ask for another thread so that the other work can be processed | |
// in parallel. Note that this will only ask for a max of #procs threads, so it's safe to call it for every dequeue. | |
// | |
workQueue.EnsureThreadRequested(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than trying to fix that, which was problematic and difficult to get round the races; idea is to work with it. So it will always (except between wake-up and dequeue gap) have a request active while it is successfully dequeueing; though only 1 rather than # cpu count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the over-requesting is currently necessary because the thread request count tracking is fragile, mainly to reduce the chance of failure to near-zero (but not zero), theoretically. @sdmaclea tried to limit the requesting to the right amount but it ended up having races - I don't think there was anything wrong with the solution, but because of existing issues, it didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but this takes a different approach by instead working with the over-requesting?
So with this change, while there is any items in the queue there will always[1] be one thread request active (plus when there are no more items, but one has just been dequeued).
Whereas currently it will always very quickly grow to always ProcessorCount thread requests active (as the number will never decrease while items are dequeued); as well as having a hot CompareExchange loop to decrease and increase the number.
[1] There will be a slight gap between thread being activated and dequeing an item, where it will drop to 0
{ | ||
// Set the search start to the next queue, if it is currently set to no preference | ||
i = (i < maxIndex) ? i + 1 : 0; | ||
Interlocked.CompareExchange(ref WorkStealingQueueList.QueueSerachStart, i, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea in the sense it tries to get stealing threads to fall into a cooperative pattern of stealing, but a potential issue may be that if a thread has many work items queued locally, many threads may get stuck on stealing from that thread, increasing contention, instead of stealing from other threads that also have work items. Difficult to tell how it would play out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea was to skip the search of all the threads before it as they have been determined empty. However; thinking about it, if only one thread is producing work then the move after this thread would cause a full scan of every thread again - maybe random is best...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
It wouldn't affect hill climbing, which calculates throughput based on how frequently threads return from ThreadPoolWorkQueue.Dispatch (in steady-state, that is closely tied to with the quantum, in busty cases it's more closely tied to the number of work items processed). But it would break the starvation heuristic, which needs a pending thread request to schedule another thread when a work item is taking too long. |
Although, hill climbing's thread increases would also be useless if there are no pending thread requests. |
The dequeue always makes another thread request if any item is dequeued; so that should be ok? |
Yes I forgot about that. I see where you're coming from now. I'll need to think about it a bit more. Also, last I tried adding an interlocked operation on enqueue/dequeue to limit the requests it significantly degraded the throughput. |
Currently both Could add a |
My motivation behind this was the time spent in |
If a thread in Dispatch exits for any reason other than seeing an empty queue, I think it would need to replace itself by calling RequestThread directly instead of EnsureThreadRequested, otherwise the number of worker threads could come back down to one while there are still many work items remaining.
It looks like this may allow reordering the read of threadRequestOutstanding to before the write to m_tailIndex in LocalPush, is this possible:
|
Differentiated between
Removed the check and went back to |
Minor nit - EnsureThreadRequested sounds like it would request a thread if one is not already requested, and RequestThread sounds like it would always request a thread, maybe swap the names? On the task and thread pool perf tests here I'm seeing a noticeable regression in throughput. Test name is the first arg, all of the tests below run with proc count threads (second arg) and for the burst work tests the third arg is a proc count multiplier for the burst length (runs one burst at a time to completion, then another burst, so on).
|
I tried running the changes in my branch ThreadRequestFix again, that change tries to request the minimum number of threads by adding an interlocked operation to enqueue/dequeue, there is still a regression:
At some point I think we'd have to take the regression to fix the issue but I think we can compensate with other improvements. I'd also like to think about it a bit more and see if there's a way. |
Closing this for now, might revisit again |
Latest attempt https://github.com/dotnet/coreclr/compare/master...benaadams:RequestWorkerThread?expand=1 Better for sustained and some bursts Pre
Post
But much worse for some bursts Pre
Post
|
As number of cores increases, so does ThreadPool contention; aim is to reduce contention in the plumbing, by reducing the number of threads looking for work when queues are mostly empty
Remove the
CompareExchange
spinning fromEnsureThreadRequested
andMarkThreadRequestSatisfied
Max one threadpool request outstanding
Approaching https://github.com/dotnet/coreclr/issues/14017 from a different direction
@kouvel would only having one ThreadPool request outstanding break hill climbing?
/cc @sdmaclea @stephentoub @vancem