Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[Wip] Improve Threadpool QUWI throughput #5943

Closed
wants to merge 23 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Jun 23, 2016

11% Improvement for regular queuing (1.6sec over 10M QUWI)
19.4% improvement for high thread count queuing (MinWorkerThreads=500) (7.2sec over 10M QUWI)

Test project: https://gist.github.com/benaadams/b022934e62a3ac1c4f261be3216b1111

10M threadpool queues and executes. Changed items in red, ExecutionContext.Run highlighted and list past ExecutionContext.Restore for relative comparison.

Threadpool QUWI before
threadpool-before

Threadpool QUWI after 2nd Update
threadpool-after-2

@benaadams benaadams force-pushed the threapool-falsesharing branch 3 times, most recently from 5057952 to aef37e4 Compare June 23, 2016 05:19
internal class WorkStealingQueue
{
private const int INITIAL_SIZE = 32;
internal volatile IThreadPoolWorkItem[] m_array = new IThreadPoolWorkItem[INITIAL_SIZE];
internal volatile PaddedWorkItem[] m_array = new PaddedWorkItem[INITIAL_SIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you 64-byte padding the items in the queue? The queue is owned by a single thread, and all other threads need to take a lock to access it. The owning thread also needs to take a lock when there are fewer than two items in the queue. The only case where you'd have contention this would help with is if there was a thread stealing concurrently with the owning thread pushing/popping on a list with at least two elements. At that point, they're already some distance apart, though not necessarily a full cache line. Have you shown that this change makes a notable improvement? It does so at the expense of effectively increasing the size of every work item by 56 bytes on 64-bit, since every work item reference to be stored now consumes 64 bytes instead of 8 bytes (plus the size of the work item object itself).

@benaadams benaadams changed the title Prevent ThreadPool false sharing [Wip} Prevent ThreadPool false sharing Jun 23, 2016
@benaadams benaadams changed the title [Wip} Prevent ThreadPool false sharing [Wip] Prevent ThreadPool false sharing Jun 23, 2016
@benaadams
Copy link
Member Author

@stephentoub as you pointed out don't think padding the items is helpful

Still working on it - hot spots are Dequeue and TrySteal

@benaadams
Copy link
Member Author

Looking closer the main effect may just be looping the queues (many threads) with mainly empty queues.

@omariom
Copy link

omariom commented Jun 23, 2016

@benaadams I think you should start from finding false sharings. In this article shown how to use VS for that. Not sure if it possible on Windows 10,but on Win 7 it was.

update: Another good tool Intel VTune
Not sure if it works with Windows 10.

@benaadams
Copy link
Member Author

@omariom there is false sharing in stealing; however the current implementation even with the false sharing is pretty hard to improve on. Still iterating, though looking at something quite different than the PR in this current state.

@omariom
Copy link

omariom commented Jun 25, 2016

Interesting optimization would be to find all the places (on hot pathes) where volatile reads/writes are unnecessary, replace them with plain reads and use Volatile class in the rest.
It may help on ARM where volatile reads/writes are implemented as fairly expensive full barriers.

@benaadams
Copy link
Member Author

benaadams commented Jun 26, 2016

Interesting optimization would be to find all the places (on hot pathes) where volatile reads/writes are unnecessary, replace them with plain reads and use Volatile class in the rest.

There are some areas where this might be possible. Will try it and measure the impact.

Although such a change does make me a little uncomfortable 😟

@benaadams benaadams changed the title [Wip] Prevent ThreadPool false sharing [Wip] Improve Threadpool throughput Jun 27, 2016
@benaadams
Copy link
Member Author

Bit better, getting a 6% improvement for QueueUserWorkItem throughput for 10M work items (4 core 1 socket)
With 9% improvement for set COMPlus_ThreadPool_ForceMinWorkerThreads=500

Still investigating.

@benaadams
Copy link
Member Author

benaadams commented Jun 27, 2016

10% Improvement for regular (13.1s vs 14.7s)
14% improvement for set COMPlus_ThreadPool_ForceMinWorkerThreads=500 (31.2s vs 36.6s)

@benaadams
Copy link
Member Author

10M threadpool queues and executes. Changed items in red, ExecutionContext.Run highlighted and list to System.Random.Sample for relative comparison.

Threadpool QUWI before
threadpool-before

Threadpool QUWI after
threadpool-after

@benaadams
Copy link
Member Author

Test code https://gist.github.com/benaadams/b022934e62a3ac1c4f261be3216b1111

It also allocates 2112 bytes per 255 queued items in discarded QueueSegments (85MB per 10M) - which give a process equilibrium at 300MB-400MB memory use vs 50MB equilibrium without these allocations.

Caching a QueueSegment as it gets dropped off the tail and the reusing it as it for a new head avoids these allocations but its also not entirely straightforward with the concurrency flows; so not perusing that at this stage.

I believe the changes in this PR should not alter any of the concurrency behaviour.

@benaadams benaadams changed the title [Wip] Improve Threadpool throughput Improve Threadpool QUWI throughput Jun 28, 2016
@prajwal-aithal
Copy link

@dotnet-bot test Linux ARM Emulator Cross Debug Build

@benaadams
Copy link
Member Author

@dotnet-bot test Linux ARM Emulator Cross Release Build

@benaadams
Copy link
Member Author

@dotnet-bot test this please

@benaadams
Copy link
Member Author

benaadams commented Jun 28, 2016

Added QueueSegment reuse as commit, needs tests rerunning

@danmoseley
Copy link
Member

@benaadams what remains here to call this PR good to go?

@benaadams
Copy link
Member Author

@danmosemsft it needs to be freshened and rebased. I'll open as another PR with new results as there is a lot of noise now in this one.

@benaadams benaadams closed this Oct 14, 2016
@benaadams benaadams deleted the threapool-falsesharing branch March 27, 2018 05:11
@benaadams benaadams restored the threapool-falsesharing branch March 27, 2018 05:11
@benaadams benaadams deleted the threapool-falsesharing branch January 11, 2019 21:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants