Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Threadpool exploration RFC #18158

Closed
wants to merge 4 commits into from
Closed

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented May 29, 2018

As number of cores increases, so does ThreadPool contention; aim is to reduce contention in the plumbing, by reducing the number of threads looking for work when queues are mostly empty

  • Remove the CompareExchange spinning from EnsureThreadRequested and MarkThreadRequestSatisfied

  • Max one threadpool request outstanding

Approaching https://github.com/dotnet/coreclr/issues/14017 from a different direction

@kouvel would only having one ThreadPool request outstanding break hill climbing?

/cc @sdmaclea @stephentoub @vancem

Max one threadpool request outstanding

When a steal takes place, store location for next thread to continue
(steal) search from that point
break;
}
count = prev;
ThreadPool.RequestWorkerThread();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work. It would be fine (maybe even better) if those work items are very short, but they may not be and it may be beneficial to have multiple threads processing them. I have some thoughts on how to approach this issue and other perf issues here, but I'd have to prototype them to validate it. I think it will need some design changes to solve these problems.

Copy link
Member Author

@benaadams benaadams May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work.

Except... one of the things we were trying to fix in the reverted PR https://github.com/dotnet/coreclr/issues/14017 was the over requesting of threads; because the very first thing Dispatch if it dequeues a work item; before running it, is to request another thread (as does Enqueue - so double requesting):

// If we found work, there may be more work. Ask for another thread so that the other work can be processed
// in parallel. Note that this will only ask for a max of #procs threads, so it's safe to call it for every dequeue.
//
workQueue.EnsureThreadRequested();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than trying to fix that, which was problematic and difficult to get round the races; idea is to work with it. So it will always (except between wake-up and dequeue gap) have a request active while it is successfully dequeueing; though only 1 rather than # cpu count?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the over-requesting is currently necessary because the thread request count tracking is fragile, mainly to reduce the chance of failure to near-zero (but not zero), theoretically. @sdmaclea tried to limit the requesting to the right amount but it ended up having races - I don't think there was anything wrong with the solution, but because of existing issues, it didn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this takes a different approach by instead working with the over-requesting?

So with this change, while there is any items in the queue there will always[1] be one thread request active (plus when there are no more items, but one has just been dequeued).

Whereas currently it will always very quickly grow to always ProcessorCount thread requests active (as the number will never decrease while items are dequeued); as well as having a hot CompareExchange loop to decrease and increase the number.

[1] There will be a slight gap between thread being activated and dequeing an item, where it will drop to 0

{
// Set the search start to the next queue, if it is currently set to no preference
i = (i < maxIndex) ? i + 1 : 0;
Interlocked.CompareExchange(ref WorkStealingQueueList.QueueSerachStart, i, -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea in the sense it tries to get stealing threads to fall into a cooperative pattern of stealing, but a potential issue may be that if a thread has many work items queued locally, many threads may get stuck on stealing from that thread, increasing contention, instead of stealing from other threads that also have work items. Difficult to tell how it would play out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea was to skip the search of all the threads before it as they have been determined empty. However; thinking about it, if only one thread is producing work then the move after this thread would cause a full scan of every thread again - maybe random is best...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

@kouvel
Copy link
Member

kouvel commented May 29, 2018

would only having one ThreadPool request outstanding break hill climbing?

It wouldn't affect hill climbing, which calculates throughput based on how frequently threads return from ThreadPoolWorkQueue.Dispatch (in steady-state, that is closely tied to with the quantum, in busty cases it's more closely tied to the number of work items processed).

But it would break the starvation heuristic, which needs a pending thread request to schedule another thread when a work item is taking too long.

@kouvel
Copy link
Member

kouvel commented May 29, 2018

Although, hill climbing's thread increases would also be useless if there are no pending thread requests.

@benaadams
Copy link
Member Author

But it would break the starvation heuristic, which needs a pending thread request to schedule another thread when a work item is taking too long.

Although, hill climbing's thread increases would also be useless if there are no pending thread requests.

The dequeue always makes another thread request if any item is dequeued; so that should be ok?

@kouvel
Copy link
Member

kouvel commented May 29, 2018

The dequeue always makes another thread request if any item is dequeued; so that should be ok?

Yes I forgot about that. I see where you're coming from now. I'll need to think about it a bit more. Also, last I tried adding an interlocked operation on enqueue/dequeue to limit the requests it significantly degraded the throughput.

@benaadams
Copy link
Member Author

Also, last I tried adding an interlocked operation on enqueue/dequeue to limit the requests it significantly degraded the throughput.

Currently both EnsureThreadRequested and MarkThreadRequestSatisfied use a Interlocked.CompareExchange loop; and it changes it to a Interlocked.Exchange and Volatile.Write

Could add a Volatile.Read guard first (as now, before the loop)?

@benaadams
Copy link
Member Author

benaadams commented May 30, 2018

My motivation behind this was the time spent in EnsureThreadRequested for some workloads (higher than Dequeue) https://github.com/dotnet/coreclr/issues/14017#issuecomment-390080961

image

@kouvel
Copy link
Member

kouvel commented May 30, 2018

If a thread in Dispatch exits for any reason other than seeing an empty queue, I think it would need to replace itself by calling RequestThread directly instead of EnsureThreadRequested, otherwise the number of worker threads could come back down to one while there are still many work items remaining.

Could add a Volatile.Read guard first (as now, before the loop)?

It looks like this may allow reordering the read of threadRequestOutstanding to before the write to m_tailIndex in LocalPush, is this possible:

  • T0 sees threadRequestOutstanding == 1
  • T1 sets threadRequestOutstanding = 0
  • T1 tries to steal from T0 and fails because tail is not updated yet
  • T0 sets tail, and now there is a pending work item with threadRequestOutstanding == 0 and no threads to process it

@benaadams
Copy link
Member Author

If a thread in Dispatch exits for any reason other than seeing an empty queue, I think it would need to replace itself by calling RequestThread directly

Differentiated between RequestThread and EnsureThreadRequested; with the second being unconditional and called by exit other than empty queue.

Could add a Volatile.Read guard first (as now, before the loop)?

It looks like this may allow reordering the read ...

Removed the check and went back to Exchange

@kouvel
Copy link
Member

kouvel commented May 30, 2018

Minor nit - EnsureThreadRequested sounds like it would request a thread if one is not already requested, and RequestThread sounds like it would always request a thread, maybe swap the names?

On the task and thread pool perf tests here I'm seeing a noticeable regression in throughput. Test name is the first arg, all of the tests below run with proc count threads (second arg) and for the burst work tests the third arg is a proc count multiplier for the burst length (runs one burst at a time to completion, then another burst, so on).

Spin                                           Left score       Right score      ∆ Score   ∆ Score %
---------------------------------------------  ---------------  ---------------  --------  ---------
TaskBurstWorkThroughput 1PcT 0000.5PcWi          704.53 ±0.22%    483.91 ±0.64%   -220.62    -31.31%
TaskBurstWorkThroughput 1PcT 0001.0PcWi         1205.73 ±0.37%    754.82 ±0.55%   -450.91    -37.40%
TaskBurstWorkThroughput 1PcT 0004.0PcWi         3833.26 ±0.37%   2917.28 ±0.40%   -915.99    -23.90%
TaskBurstWorkThroughput 1PcT 0016.0PcWi         7527.02 ±0.18%   5985.17 ±0.26%  -1541.85    -20.48%
TaskBurstWorkThroughput 1PcT 0064.0PcWi         9876.11 ±0.19%   8266.56 ±0.25%  -1609.55    -16.30%
TaskBurstWorkThroughput 1PcT 0256.0PcWi        10751.04 ±0.44%   9058.54 ±0.11%  -1692.49    -15.74%
TaskSustainedWorkThroughput 1PcT               13148.42 ±0.37%  11890.63 ±0.33%  -1257.79     -9.57%
ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi    738.70 ±0.14%    491.33 ±1.55%   -247.37    -33.49%
ThreadPoolBurstWorkThroughput 1PcT 0001.0PcWi   1365.77 ±0.27%    751.64 ±0.66%   -614.13    -44.97%
ThreadPoolBurstWorkThroughput 1PcT 0004.0PcWi   3569.96 ±0.36%   2905.31 ±0.46%   -664.65    -18.62%
ThreadPoolBurstWorkThroughput 1PcT 0016.0PcWi   6128.14 ±0.65%   5428.41 ±0.34%   -699.73    -11.42%
ThreadPoolBurstWorkThroughput 1PcT 0064.0PcWi   7424.80 ±0.54%   6442.15 ±0.72%   -982.65    -13.23%
ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi   7837.07 ±0.57%   6753.00 ±0.45%  -1084.07    -13.83%
ThreadPoolSustainedWorkThroughput 1PcT          8992.27 ±0.42%   7527.19 ±0.35%  -1465.08    -16.29%
---------------------------------------------  ---------------  ---------------  --------  ---------
Total                                           4112.04 ±0.36%   3180.36 ±0.51%   -931.68    -22.66%

@kouvel
Copy link
Member

kouvel commented May 30, 2018

I tried running the changes in my branch ThreadRequestFix again, that change tries to request the minimum number of threads by adding an interlocked operation to enqueue/dequeue, there is still a regression:

Spin                                           Left score       Right score      ∆ Score   ∆ Score %
---------------------------------------------  ---------------  ---------------  --------  ---------
TaskBurstWorkThroughput 1PcT 0000.5PcWi          697.81 ±0.25%    708.22 ±0.67%     10.41      1.49%
TaskBurstWorkThroughput 1PcT 0001.0PcWi         1222.20 ±0.07%   1300.05 ±0.09%     77.85      6.37%
TaskBurstWorkThroughput 1PcT 0004.0PcWi         3823.98 ±0.49%   3649.61 ±0.40%   -174.36     -4.56%
TaskBurstWorkThroughput 1PcT 0016.0PcWi         7557.12 ±0.23%   6603.54 ±0.45%   -953.57    -12.62%
TaskBurstWorkThroughput 1PcT 0064.0PcWi         9957.05 ±0.38%   8390.75 ±0.12%  -1566.31    -15.73%
TaskBurstWorkThroughput 1PcT 0256.0PcWi        10835.69 ±0.08%   9064.83 ±0.08%  -1770.86    -16.34%
TaskSustainedWorkThroughput 1PcT               13248.95 ±0.08%  11234.74 ±0.06%  -2014.21    -15.20%
ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi    718.83 ±0.79%    718.14 ±0.72%     -0.69     -0.10%
ThreadPoolBurstWorkThroughput 1PcT 0001.0PcWi   1363.58 ±0.25%   1547.38 ±0.32%    183.80     13.48%
ThreadPoolBurstWorkThroughput 1PcT 0004.0PcWi   3580.13 ±0.76%   3520.76 ±0.87%    -59.37     -1.66%
ThreadPoolBurstWorkThroughput 1PcT 0016.0PcWi   6178.10 ±0.35%   5747.01 ±0.41%   -431.09     -6.98%
ThreadPoolBurstWorkThroughput 1PcT 0064.0PcWi   7405.27 ±0.37%   6758.71 ±0.63%   -646.55     -8.73%
ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi   7773.02 ±0.45%   7088.42 ±0.33%   -684.60     -8.81%
ThreadPoolSustainedWorkThroughput 1PcT          9160.64 ±0.25%   7571.38 ±0.25%  -1589.25    -17.35%
---------------------------------------------  ---------------  ---------------  --------  ---------
Total                                           4117.61 ±0.34%   3845.50 ±0.39%   -272.11     -6.61%

At some point I think we'd have to take the regression to fix the issue but I think we can compensate with other improvements. I'd also like to think about it a bit more and see if there's a way.

@benaadams
Copy link
Member Author

Closing this for now, might revisit again

@benaadams benaadams closed this Aug 21, 2018
@benaadams benaadams deleted the threadpool branch November 23, 2018 18:03
@benaadams
Copy link
Member Author

Latest attempt https://github.com/dotnet/coreclr/compare/master...benaadams:RequestWorkerThread?expand=1

Better for sustained and some bursts

Pre

ThreadPoolSustainedWorkThroughput 1PcT
Score: 8854.050461
Score: 8889.016221
Score: 8395.578876
Score: 8599.222212

ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi
Score: 8088.785001
Score: 7737.525681
Score: 7565.940737
Score: 7372.339747

Post

ThreadPoolSustainedWorkThroughput 1PcT
Score: 9042.440356
Score: 9176.166294
Score: 8934.281855
Score: 9366.574882

ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi
Score: 8039.497601
Score: 7674.125610
Score: 8260.829927
Score: 8160.117525

But much worse for some bursts

Pre

ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi
Score: 435.122252
Score: 440.234323
Score: 409.431425
Score: 468.664461

Post

ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi
Score: 243.718592
Score: 244.918312
Score: 240.200779
Score: 233.747782

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants