[WIP] Threadpool exploration RFC #18158

benaadams · 2018-05-29T12:38:50Z

As number of cores increases, so does ThreadPool contention; aim is to reduce contention in the plumbing, by reducing the number of threads looking for work when queues are mostly empty

Remove the CompareExchange spinning from EnsureThreadRequested and MarkThreadRequestSatisfied
Max one threadpool request outstanding

Approaching https://github.com/dotnet/coreclr/issues/14017 from a different direction

@kouvel would only having one ThreadPool request outstanding break hill climbing?

/cc @sdmaclea @stephentoub @vancem

Max one threadpool request outstanding When a steal takes place, store location for next thread to continue (steal) search from that point

kouvel · 2018-05-29T18:38:47Z

src/System.Private.CoreLib/src/System/Threading/ThreadPool.cs

-                    break;
-                }
-                count = prev;
+                ThreadPool.RequestWorkerThread();


An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work. It would be fine (maybe even better) if those work items are very short, but they may not be and it may be beneficial to have multiple threads processing them. I have some thoughts on how to approach this issue and other perf issues here, but I'd have to prototype them to validate it. I think it will need some design changes to solve these problems.

An issue with this is going to be that when an idle thread pool gets a burst of work items in quick succession, only one thread would be requested, and only one thread then will be available to process all of that work.

Except... one of the things we were trying to fix in the reverted PR https://github.com/dotnet/coreclr/issues/14017 was the over requesting of threads; because the very first thing Dispatch if it dequeues a work item; before running it, is to request another thread (as does Enqueue - so double requesting):

coreclr/src/System.Private.CoreLib/src/System/Threading/ThreadPool.cs

Lines 561 to 564 in 208ea16

// If we found work, there may be more work. Ask for another thread so that the other work can be processed

// in parallel. Note that this will only ask for a max of #procs threads, so it's safe to call it for every dequeue.

//

workQueue.EnsureThreadRequested();

Rather than trying to fix that, which was problematic and difficult to get round the races; idea is to work with it. So it will always (except between wake-up and dequeue gap) have a request active while it is successfully dequeueing; though only 1 rather than # cpu count?

I believe the over-requesting is currently necessary because the thread request count tracking is fragile, mainly to reduce the chance of failure to near-zero (but not zero), theoretically. @sdmaclea tried to limit the requesting to the right amount but it ended up having races - I don't think there was anything wrong with the solution, but because of existing issues, it didn't work.

Agreed, but this takes a different approach by instead working with the over-requesting?

So with this change, while there is any items in the queue there will always[1] be one thread request active (plus when there are no more items, but one has just been dequeued).

Whereas currently it will always very quickly grow to always ProcessorCount thread requests active (as the number will never decrease while items are dequeued); as well as having a hot CompareExchange loop to decrease and increase the number.

[1] There will be a slight gap between thread being activated and dequeing an item, where it will drop to 0

kouvel · 2018-05-29T18:38:48Z

src/System.Private.CoreLib/src/System/Threading/ThreadPool.cs

+                {
+                    // Set the search start to the next queue, if it is currently set to no preference
+                    i = (i < maxIndex) ? i + 1 : 0;
+                    Interlocked.CompareExchange(ref WorkStealingQueueList.QueueSerachStart, i, -1);


This is an interesting idea in the sense it tries to get stealing threads to fall into a cooperative pattern of stealing, but a potential issue may be that if a thread has many work items queued locally, many threads may get stuck on stealing from that thread, increasing contention, instead of stealing from other threads that also have work items. Difficult to tell how it would play out.

Idea was to skip the search of all the threads before it as they have been determined empty. However; thinking about it, if only one thread is producing work then the move after this thread would cause a full scan of every thread again - maybe random is best...

I see, makes sense

kouvel · 2018-05-29T18:44:29Z

would only having one ThreadPool request outstanding break hill climbing?

It wouldn't affect hill climbing, which calculates throughput based on how frequently threads return from ThreadPoolWorkQueue.Dispatch (in steady-state, that is closely tied to with the quantum, in busty cases it's more closely tied to the number of work items processed).

But it would break the starvation heuristic, which needs a pending thread request to schedule another thread when a work item is taking too long.

kouvel · 2018-05-29T18:48:50Z

Although, hill climbing's thread increases would also be useless if there are no pending thread requests.

benaadams · 2018-05-29T19:00:35Z

But it would break the starvation heuristic, which needs a pending thread request to schedule another thread when a work item is taking too long.

Although, hill climbing's thread increases would also be useless if there are no pending thread requests.

The dequeue always makes another thread request if any item is dequeued; so that should be ok?

kouvel · 2018-05-29T19:51:55Z

The dequeue always makes another thread request if any item is dequeued; so that should be ok?

Yes I forgot about that. I see where you're coming from now. I'll need to think about it a bit more. Also, last I tried adding an interlocked operation on enqueue/dequeue to limit the requests it significantly degraded the throughput.

benaadams · 2018-05-29T19:58:59Z

Also, last I tried adding an interlocked operation on enqueue/dequeue to limit the requests it significantly degraded the throughput.

Currently both EnsureThreadRequested and MarkThreadRequestSatisfied use a Interlocked.CompareExchange loop; and it changes it to a Interlocked.Exchange and Volatile.Write

Could add a Volatile.Read guard first (as now, before the loop)?

benaadams · 2018-05-30T02:44:58Z

My motivation behind this was the time spent in EnsureThreadRequested for some workloads (higher than Dequeue) https://github.com/dotnet/coreclr/issues/14017#issuecomment-390080961

kouvel · 2018-05-30T04:48:33Z

If a thread in Dispatch exits for any reason other than seeing an empty queue, I think it would need to replace itself by calling RequestThread directly instead of EnsureThreadRequested, otherwise the number of worker threads could come back down to one while there are still many work items remaining.

Could add a Volatile.Read guard first (as now, before the loop)?

It looks like this may allow reordering the read of threadRequestOutstanding to before the write to m_tailIndex in LocalPush, is this possible:

T0 sees threadRequestOutstanding == 1
T1 sets threadRequestOutstanding = 0
T1 tries to steal from T0 and fails because tail is not updated yet
T0 sets tail, and now there is a pending work item with threadRequestOutstanding == 0 and no threads to process it

benaadams · 2018-05-30T05:13:17Z

If a thread in Dispatch exits for any reason other than seeing an empty queue, I think it would need to replace itself by calling RequestThread directly

Differentiated between RequestThread and EnsureThreadRequested; with the second being unconditional and called by exit other than empty queue.

Could add a Volatile.Read guard first (as now, before the loop)?

It looks like this may allow reordering the read ...

Removed the check and went back to Exchange

kouvel · 2018-05-30T05:56:17Z

Minor nit - EnsureThreadRequested sounds like it would request a thread if one is not already requested, and RequestThread sounds like it would always request a thread, maybe swap the names?

On the task and thread pool perf tests here I'm seeing a noticeable regression in throughput. Test name is the first arg, all of the tests below run with proc count threads (second arg) and for the burst work tests the third arg is a proc count multiplier for the burst length (runs one burst at a time to completion, then another burst, so on).

Spin                                           Left score       Right score      ∆ Score   ∆ Score %
---------------------------------------------  ---------------  ---------------  --------  ---------
TaskBurstWorkThroughput 1PcT 0000.5PcWi          704.53 ±0.22%    483.91 ±0.64%   -220.62    -31.31%
TaskBurstWorkThroughput 1PcT 0001.0PcWi         1205.73 ±0.37%    754.82 ±0.55%   -450.91    -37.40%
TaskBurstWorkThroughput 1PcT 0004.0PcWi         3833.26 ±0.37%   2917.28 ±0.40%   -915.99    -23.90%
TaskBurstWorkThroughput 1PcT 0016.0PcWi         7527.02 ±0.18%   5985.17 ±0.26%  -1541.85    -20.48%
TaskBurstWorkThroughput 1PcT 0064.0PcWi         9876.11 ±0.19%   8266.56 ±0.25%  -1609.55    -16.30%
TaskBurstWorkThroughput 1PcT 0256.0PcWi        10751.04 ±0.44%   9058.54 ±0.11%  -1692.49    -15.74%
TaskSustainedWorkThroughput 1PcT               13148.42 ±0.37%  11890.63 ±0.33%  -1257.79     -9.57%
ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi    738.70 ±0.14%    491.33 ±1.55%   -247.37    -33.49%
ThreadPoolBurstWorkThroughput 1PcT 0001.0PcWi   1365.77 ±0.27%    751.64 ±0.66%   -614.13    -44.97%
ThreadPoolBurstWorkThroughput 1PcT 0004.0PcWi   3569.96 ±0.36%   2905.31 ±0.46%   -664.65    -18.62%
ThreadPoolBurstWorkThroughput 1PcT 0016.0PcWi   6128.14 ±0.65%   5428.41 ±0.34%   -699.73    -11.42%
ThreadPoolBurstWorkThroughput 1PcT 0064.0PcWi   7424.80 ±0.54%   6442.15 ±0.72%   -982.65    -13.23%
ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi   7837.07 ±0.57%   6753.00 ±0.45%  -1084.07    -13.83%
ThreadPoolSustainedWorkThroughput 1PcT          8992.27 ±0.42%   7527.19 ±0.35%  -1465.08    -16.29%
---------------------------------------------  ---------------  ---------------  --------  ---------
Total                                           4112.04 ±0.36%   3180.36 ±0.51%   -931.68    -22.66%

kouvel · 2018-05-30T06:07:55Z

I tried running the changes in my branch ThreadRequestFix again, that change tries to request the minimum number of threads by adding an interlocked operation to enqueue/dequeue, there is still a regression:

Spin                                           Left score       Right score      ∆ Score   ∆ Score %
---------------------------------------------  ---------------  ---------------  --------  ---------
TaskBurstWorkThroughput 1PcT 0000.5PcWi          697.81 ±0.25%    708.22 ±0.67%     10.41      1.49%
TaskBurstWorkThroughput 1PcT 0001.0PcWi         1222.20 ±0.07%   1300.05 ±0.09%     77.85      6.37%
TaskBurstWorkThroughput 1PcT 0004.0PcWi         3823.98 ±0.49%   3649.61 ±0.40%   -174.36     -4.56%
TaskBurstWorkThroughput 1PcT 0016.0PcWi         7557.12 ±0.23%   6603.54 ±0.45%   -953.57    -12.62%
TaskBurstWorkThroughput 1PcT 0064.0PcWi         9957.05 ±0.38%   8390.75 ±0.12%  -1566.31    -15.73%
TaskBurstWorkThroughput 1PcT 0256.0PcWi        10835.69 ±0.08%   9064.83 ±0.08%  -1770.86    -16.34%
TaskSustainedWorkThroughput 1PcT               13248.95 ±0.08%  11234.74 ±0.06%  -2014.21    -15.20%
ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi    718.83 ±0.79%    718.14 ±0.72%     -0.69     -0.10%
ThreadPoolBurstWorkThroughput 1PcT 0001.0PcWi   1363.58 ±0.25%   1547.38 ±0.32%    183.80     13.48%
ThreadPoolBurstWorkThroughput 1PcT 0004.0PcWi   3580.13 ±0.76%   3520.76 ±0.87%    -59.37     -1.66%
ThreadPoolBurstWorkThroughput 1PcT 0016.0PcWi   6178.10 ±0.35%   5747.01 ±0.41%   -431.09     -6.98%
ThreadPoolBurstWorkThroughput 1PcT 0064.0PcWi   7405.27 ±0.37%   6758.71 ±0.63%   -646.55     -8.73%
ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi   7773.02 ±0.45%   7088.42 ±0.33%   -684.60     -8.81%
ThreadPoolSustainedWorkThroughput 1PcT          9160.64 ±0.25%   7571.38 ±0.25%  -1589.25    -17.35%
---------------------------------------------  ---------------  ---------------  --------  ---------
Total                                           4117.61 ±0.34%   3845.50 ±0.39%   -272.11     -6.61%

At some point I think we'd have to take the regression to fix the issue but I think we can compensate with other improvements. I'd also like to think about it a bit more and see if there's a way.

benaadams · 2018-08-21T01:03:51Z

Closing this for now, might revisit again

benaadams · 2018-11-23T19:33:24Z

Latest attempt https://github.com/dotnet/coreclr/compare/master...benaadams:RequestWorkerThread?expand=1

Better for sustained and some bursts

Pre

ThreadPoolSustainedWorkThroughput 1PcT
Score: 8854.050461
Score: 8889.016221
Score: 8395.578876
Score: 8599.222212

ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi
Score: 8088.785001
Score: 7737.525681
Score: 7565.940737
Score: 7372.339747

Post

ThreadPoolSustainedWorkThroughput 1PcT
Score: 9042.440356
Score: 9176.166294
Score: 8934.281855
Score: 9366.574882

ThreadPoolBurstWorkThroughput 1PcT 0256.0PcWi
Score: 8039.497601
Score: 7674.125610
Score: 8260.829927
Score: 8160.117525

But much worse for some bursts

Pre

ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi
Score: 435.122252
Score: 440.234323
Score: 409.431425
Score: 468.664461

Post

ThreadPoolBurstWorkThroughput 1PcT 0000.5PcWi
Score: 243.718592
Score: 244.918312
Score: 240.200779
Score: 233.747782

Threadpool exploration

6cd5d40

Max one threadpool request outstanding When a steal takes place, store location for next thread to continue (steal) search from that point

kouvel reviewed May 29, 2018

View reviewed changes

Add Volatile.Read guard

dbbfa48

benaadams force-pushed the threadpool branch from 9590ee1 to c705310 Compare May 30, 2018 02:48

Go back to full random

c705310

Feedback

b60449c

benaadams closed this Aug 21, 2018

benaadams deleted the threadpool branch November 23, 2018 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Threadpool exploration RFC #18158

[WIP] Threadpool exploration RFC #18158

benaadams commented May 29, 2018 •

edited

Loading

kouvel May 29, 2018

benaadams May 29, 2018 •

edited

Loading

benaadams May 29, 2018

kouvel May 29, 2018

benaadams May 29, 2018

kouvel May 29, 2018

benaadams May 30, 2018

kouvel May 30, 2018

kouvel commented May 29, 2018 •

edited

Loading

kouvel commented May 29, 2018

benaadams commented May 29, 2018

kouvel commented May 29, 2018

benaadams commented May 29, 2018

benaadams commented May 30, 2018 •

edited

Loading

kouvel commented May 30, 2018

benaadams commented May 30, 2018

kouvel commented May 30, 2018 •

edited

Loading

kouvel commented May 30, 2018 •

edited

Loading

benaadams commented Aug 21, 2018

benaadams commented Nov 23, 2018

	// If we found work, there may be more work. Ask for another thread so that the other work can be processed
	// in parallel. Note that this will only ask for a max of #procs threads, so it's safe to call it for every dequeue.
	//
	workQueue.EnsureThreadRequested();

[WIP] Threadpool exploration RFC #18158

[WIP] Threadpool exploration RFC #18158

Conversation

benaadams commented May 29, 2018 • edited Loading

kouvel May 29, 2018

Choose a reason for hiding this comment

benaadams May 29, 2018 • edited Loading

Choose a reason for hiding this comment

benaadams May 29, 2018

Choose a reason for hiding this comment

kouvel May 29, 2018

Choose a reason for hiding this comment

benaadams May 29, 2018

Choose a reason for hiding this comment

kouvel May 29, 2018

Choose a reason for hiding this comment

benaadams May 30, 2018

Choose a reason for hiding this comment

kouvel May 30, 2018

Choose a reason for hiding this comment

kouvel commented May 29, 2018 • edited Loading

kouvel commented May 29, 2018

benaadams commented May 29, 2018

kouvel commented May 29, 2018

benaadams commented May 29, 2018

benaadams commented May 30, 2018 • edited Loading

kouvel commented May 30, 2018

benaadams commented May 30, 2018

kouvel commented May 30, 2018 • edited Loading

kouvel commented May 30, 2018 • edited Loading

benaadams commented Aug 21, 2018

benaadams commented Nov 23, 2018

benaadams commented May 29, 2018 •

edited

Loading

benaadams May 29, 2018 •

edited

Loading

kouvel commented May 29, 2018 •

edited

Loading

benaadams commented May 30, 2018 •

edited

Loading

kouvel commented May 30, 2018 •

edited

Loading

kouvel commented May 30, 2018 •

edited

Loading