This repository has been archived by the owner on Nov 1, 2020. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improve thread pool performance on Unix #6955
Improve thread pool performance on Unix #6955
Changes from 9 commits
3a27429
5e472b7
938f134
0dc3770
41f83b9
265f41a
6b34732
9d2fad4
579bd96
750db85
e7afe31
ddfb816
81dc860
e60f673
2aca3bd
05e2183
9a3678e
9435818
4678e25
cd0979d
2a109f3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that you may use this implementation for Mono. It shouldn't be necessary to change the implementation in CoreRT for that purpose, what it was doing before is more appropriate for CoreRT. It would probably be beneficial to separate
LowLevelLifoSemaphore
, which offers just the base LIFO semaphore functionality (no spinning, etc.), fromClrLifoSemaphore
, which does all of the rest. MaybeClrLifoSemaphore
could use a better name but I would leave that for later, as there are more abstractions that could be done.ClrLifoSemaphore
would useLowLevelLifoSemaphore
. With that, the implementation ofClrLifoSemaphore
can be shared between runtimes and the only runtime-specific dependency would beLowLevelLifoSemaphore
, for which runtimes can provide an implementation that is suitable. For CoreRT, that would be using the wait subsystem, for Mono it may be this implementation, for CoreCLR it would be the PAL, and so on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't split up
LowLevelLifoSemaphore
andClrLifoSemaphore
because technically there's only one user of it - the thread pool. There's probably never going to be another one because there's SemaphoreSlim, so as long as you specifically don't depend on uninterruptibility and LIFO order (at the same time) there's no reason to use this one. LowLevelLock already contained spinning code, so it seemed acceptable that spinning is part of the low-level code. TheLowLevel
prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).The runtime sharing is a bit of a PITA part that I had to put some thought into. I am trying hard to avoid a Mono-specific implementation (as long as I can avoid hurting CoreRT at the same time).
While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here. Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).
My current plan is to break this dependency on WaitSubsystem on this one place and move LowLevelLock/LowLevelMonitor/LowLevelSemaphore into shared part of CoreLib as a feature (eg. FeaturePortableLowLevelSync). That would be shared by CoreRT and Mono. CoreCLR could use it (for testing at least), or simply skip the feature and implement the primitives on its own through existing PAL code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging
LowLevelLifoSemaphore
andClrLifoSemaphore
is fine, the wait portions can still be specialized for runtimes, and I think they should be. I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated. The tracking being done here and additional overhead is unnecessary for those runtimes, which already do similar tracking to support all of the other types of waits and the tracking is shared so there is no additional overhead with uninterruptibility. Adding uninterruptibility in both runtimes was actually very easy. Conversely it was actually a lot more work to add interruptibility to CoreRT's wait subsystem, which started off uninterruptible during initial implementation. If it's not easy to modify Mono's semaphore to support uninterruptible waits, I would still prefer to separate out this implementation specifically for Mono.At the moment the
LowLevel
part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components.LowLevelLock
does some spinning, in hindsight it probably would have made sense to separate it out. The idea was that theLowLevel
portions would be system dependencies that may be specific to runtimes (which may also be shared if possible) and anything on top of them could be shared. Even if there is only one user currently, it can help to componentize the functionality.Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.
I don't see how this implementation would be better. Typically these locks are uncontended. It's rare for two threads to both be releasing threads because the thread pool uses an interlocked operation to figure out how many threads need to be released and one of those two threads would release all of them. Perf is not a big concern but the wait subsystem is actually pretty fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted sometimes it can be advantageous to not componentize in order to add some features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have different goals for this, which makes it hard agenda to push. I think it's not unnecessarily more complicated because the amount of code removed from the wait subsystem is more than the amount of code added here. I will do some more benchmarks of different variations, but I actually think that the contention on
s_lock
inWaitSubsystem
hurts the performance significantly for heavily threaded code and this is an easy way out. The suggestion with CoreCLR was not about using it there for production code, but only as a means of bring up of the code in another runtime and for apples-to-apples performance comparison.The documentation states
LowLevelSemaphore never used the system component in Unix, it used the wait subsystem, violating any common sense about the component layering. In that sense calling it "low level" is wrong. The only other shared property is the uninterruptibility. I think it should be implemented using system components, but if I fail to push that direction then it should definitely be renamed to avoid confusion.
No overhead in terms of performance for other code paths through the WaitSubsystem. Amount of code is different story though. Performance-wise it increases contention on the single lock. It's orthogonal issue to these code changes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re the naming of
LowLevelSemaphore
: I chose that name moreso to match the naming ofLowLevelLinq
(subset of API surface reimplemented for runtime use) thanLowLevelLock
orLowLevelMonitor
(thin managed wrappers around native system components).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point is to share as much code as possible between the runtimes.
Code that is runtime specific doesn't get as much testing as the one that is shared. Likewise the performance improvements have historically been done mostly on the CoreCLR side and didn't translate to CoreRT which was a shame. Now that Mono is sharing the CoreLib code there was a push that helped to fix the gap by moving more of the code to the shared part of the CoreLib.
Ideally I would like to see the wait subsystem shared between all three runtimes, but we are still quite far away from there and it may well never happen.
Decoupling individual components - thread pool, timers, low level synchronization primitives - was an attempt to get there faster, piece by piece. Unfortunately I underestimated the complexity of the LIFO semaphore implementation and its edge cases. I assumed I could do it in less than 100 lines and simplify the WaitSubsystem at the same time to the point that it would be a win-win situation overall (no regression, cleaner component separation, the benefits of code being shared and tested by 2 runtimes and allowing testing the whole managed pool on CoreCLR as well). Eventually I ended up with an implementation that is small enough and covers all the edge cases that were brought up, but it requires unmanaged code and it's not as mature as whatever WaitSubsystem does now.
Yeah, it was flawed. I have completely rewritten it in the end, but it's out of scope of this PR. I solved couple of the issues by moving the LIFO semaphore to unmanaged code, reducing the number of PThread objects (1 mutex + 1 condition/waiting thread) and storing the LIFO entries on stack. (Unfinished code here: https://gist.github.com/filipnavara/4d58444435cb106bbc4e3ffa939e99be, missing comments and propert pthread_cond_t initalization on non-macOS systems).
I have seen it in the heavy threading+locking benchmarks before, but it could have been flaw in the benchmark methodology. I went ahead and re-run a different combinations of the code with my benchmarks. It turns out that the additional logic with interlocked
Counts
loops avoids getting to the slow path in those heavy threading scenarios.I understand it is difficult. I am not happy about the name, but I don't really have a better name right now. The "low level" part makes me feel it should not call into something "higher level" like WaitSubsystem, implementation detail or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for helpful historical note :-)
LowLevelLinq
was not really used in that part of code anymore, aside from ausing
in one of the files that seems like a leftover.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good time to share the implementation would be if a system LIFO wait support were to be found. This implementation of the component can be implemented in managed or native code, would be fairly small, and perf doesn't really matter, that I don't see a huge benefit in sharing it at the moment, especially since there are better alternatives for some runtimes. Sharing the portion above this part (the
ClrLifoSemaphore
part of tracking spinners/waiters, etc.) is a lot more useful.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to happen, at least not in a portable way. Every other library I found (notably Facebook Folly, which is referenced from couple of academic papers) uses an implementation similar to the one I came up with.