Improve thread pool performance on Unix #6955

filipnavara · 2019-02-05T16:38:46Z

There's couple of objectives to these changes:

Improve performance of thread pool on Unix for certain workloads [while not regressing it for others]
Make the LowLevel* synchronization primitives self-dependent and remove the dependency on WaitSubsystem, to allow sharing them with Mono
Bring the code closer to CoreCLR in case someone decides to switch to the managed thread pool one day

List of changes:

FirstLevelSpinWaiter is renamed to LowLevelSpinWaiter for consistency with the usage of LowLevel prefix in the wait subsystem
The constants and Wait method of LowLevelSpinWaiter are exposed for reuse from LowLevelLifoSemaphore
LowLevelLifoSemaphore is updated to more closely mimic CoreCLR's CLRLifoSemaphore spinning logic
LowLevelLifoSemaphore.Unix.cs is reimplemented on top of LowLevel[Monitor/Lock] to remove dependency on the complex WaitSubsystem and its contention on a single lock

I used improvised benchmark based on code from dotnet/coreclr#13670 (comment). It is likely flawed, but it should be useful for basic comparison.

Benchmark results are included below. On macOS it is consistent improvement over baseline with no regression on sustained load. I will run the benchmarks on Windows too and report the results. At the moment I don't have a Linux setup to test this on.

This is part of a bigger patch series that is supposed to make the portable thread pool implementation sharable across .NET runtimes. Initially I target CoreRT/Unix and Mono, but I would like to allow building CoreCLR with it for comparison purposes as well.

/cc @kouvel @jkoritzinsky @jkotas @benaadams @marek-safar

filipnavara · 2019-02-05T16:39:03Z

BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14.2 (18C48a) [Darwin 18.2.0]
Intel Core i5-5675R CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
  [Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT

Job=CoreRT  Runtime=CoreRT

Method	Toolchain	Operations	maxWorkItemCount	Mean	Error	StdDev
ThreadPoolBurstWorkThroughput	Base	100000	1	1,960.71 ms	38.3157 ms	71.966 ms
ThreadPoolBurstWorkThroughput	PR	100000	1	1,085.88 ms	34.9250 ms	101.878 ms
ThreadPoolBurstWorkThroughput	Base	100000	10	250.03 ms	6.8982 ms	20.339 ms
ThreadPoolBurstWorkThroughput	PR	100000	10	145.00 ms	5.4050 ms	15.767 ms
ThreadPoolBurstWorkThroughput	Base	100000	100	49.06 ms	0.9795 ms	2.580 ms
ThreadPoolBurstWorkThroughput	PR	100000	100	36.54 ms	2.0532 ms	5.689 ms
ThreadPoolSustainedWorkThroughput	Base	100000	?	21.40 ms	0.4233 ms	1.144 ms
ThreadPoolSustainedWorkThroughput	PR	100000	?	21.12 ms	0.6427 ms	1.895 ms

benaadams · 2019-02-05T17:04:56Z

/cc @VSadov

filipnavara · 2019-02-05T19:37:27Z

The difference on Windows (FeaturePortableThreadPool enabled) seems small, but still favorable for the change. Benchmarking it on a laptop however gives the thermal management a hard time, so it's not very precise.

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i5-7267U CPU 3.10GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT

Job=CoreRT  Runtime=CoreRT

Method	Toolchain	Operations	maxWorkItemCount	Mean	Error	StdDev
ThreadPoolBurstWorkThroughput	Base	100000	1	1,182.95 ms	22.5835 ms	22.1800 ms
ThreadPoolBurstWorkThroughput	PR	100000	1	712.75 ms	13.8821 ms	13.6341 ms
ThreadPoolBurstWorkThroughput	Base	100000	10	63.06 ms	1.3949 ms	1.7131 ms
ThreadPoolBurstWorkThroughput	PR	100000	10	57.47 ms	0.2482 ms	0.2322 ms
ThreadPoolBurstWorkThroughput	Base	100000	100	23.55 ms	0.4624 ms	0.6631 ms
ThreadPoolBurstWorkThroughput	PR	100000	100	20.48 ms	0.0812 ms	0.0719 ms
ThreadPoolSustainedWorkThroughput	Base	100000	?	14.69 ms	0.0295 ms	0.0246 ms
ThreadPoolSustainedWorkThroughput	PR	100000	?	14.51 ms	0.1514 ms	0.1342 ms

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

…bsystem, they are no longer used

filipnavara · 2019-02-06T18:16:47Z

I tried running @benaadams' benchmark from https://github.com/benaadams/ThreadPoolTaskTesting. The results are below, but don't read too much into them since it turns out that on my test machine the numbers vary quite wildly with each run (some of the numbers differ by 20% between two consecutive runs). I am not sure whether there is something I can do about it and if anything can be read from the numbers at all. I also consistently get Bus error: 10 from the last test.

I didn't try to force high priority for the process which would likely make the numbers more consistent. Unfortunately both my MacBook and iMac eventually hit thermal issues under these tests (generally, any long running tests on thread pool) and either start very loud cooling or CPU throttling.

Pre:

Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second on 4 Cores
                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)              5.092 M     4.871 M     5.168 M     5.385 M     5.327 M
- Depth    2                     4.712 M     4.439 M     4.673 M     4.714 M     4.693 M
- Depth   16                     4.922 M     4.709 M     4.930 M     4.934 M     4.903 M
- Depth   64                     4.963 M     5.004 M     5.026 M     5.017 M     4.978 M
- Depth  512                     5.027 M     5.044 M     4.963 M     4.983 M     4.959 M


QUWI Queue Local (TP)            2.586 M     4.350 M     5.972 M     6.049 M     5.982 M
- Depth    2                     3.742 M     4.902 M     5.950 M     6.010 M     5.923 M
- Depth   16                     5.772 M     6.008 M     6.183 M     6.191 M     6.088 M
- Depth   64                     5.491 M     5.478 M     5.943 M     5.669 M     5.803 M
- Depth  512                     5.876 M     5.965 M     6.069 M     6.170 M     6.087 M


SubTask Chain Return (TP)      207.876 k   771.453 k     1.242 M     1.158 M     1.145 M
- Depth    2                   242.801 k   643.887 k     1.208 M     1.242 M     1.275 M
- Depth   16                   324.677 k   738.152 k     1.389 M     1.378 M     1.463 M
- Depth   64                   324.371 k   759.935 k     1.416 M     1.419 M     1.477 M
- Depth  512                   348.896 k   886.726 k     1.810 M     1.832 M     1.796 M


SubTask Chain Awaited (TP)     165.487 k   397.849 k   696.324 k   689.829 k   736.044 k
- Depth    2                   172.645 k   410.792 k   630.540 k   719.917 k   779.266 k
- Depth   16                   199.105 k   473.097 k   771.422 k   811.847 k   848.522 k
- Depth   64                   210.068 k   518.456 k   928.639 k   933.719 k   961.701 k
- Depth  512                   244.118 k   671.726 k     1.175 M     1.198 M     1.183 M


SubTask Fanout Awaited (TP)     95.042 k   210.537 k   357.743 k   469.880 k   541.074 k
- Depth    2                   156.743 k   328.988 k   454.231 k   578.414 k   643.199 k
- Depth   16                   361.232 k   557.166 k   741.434 k   791.009 k   775.568 k
- Depth   64                   514.194 k   697.478 k   811.291 k   817.903 k   813.941 k
- Depth  512                   667.601 k   785.942 k   824.527 k   844.484 k   821.210 k


Continuation Chain (TP)         99.853 k   228.534 k   435.238 k   473.983 k   551.314 k
- Depth    2                   164.696 k   377.018 k   718.431 k   850.049 k   931.871 k
- Depth   16                   473.788 k     1.101 M     2.212 M     2.302 M     2.320 M
- Depth   64                   644.677 k     1.432 M     2.719 M     2.838 M     2.823 M
- Depth  512                   752.146 k     1.657 M     2.863 M     3.023 M     3.019 M


Continuation Fanout (TP)        76.665 k   183.625 k   318.045 k   376.486 k   423.932 k
- Depth    2                   128.625 k   344.905 k   612.668 k   609.207 k   584.534 k
- Depth   16                   657.940 k     1.204 M     1.263 M     1.284 M     1.286 M
- Depth   64                     1.136 M     1.788 M     1.474 M     1.470 M     1.463 M
- Depth  512                     1.235 M     2.074 M     1.542 M     1.550 M     1.558 M


Yield Chain Awaited (TP)       282.051 k   525.728 k   949.500 k   932.280 k     1.143 M
- Depth    2                   325.479 k   639.795 k     1.039 M     1.144 M     1.417 M
- Depth   16                   480.714 k   846.826 k     1.441 M     1.683 M     1.719 M
- Depth   64                   499.317 k   808.951 k     1.643 M     1.752 M     1.840 M
- Depth  512                Bus error: 10

Post:

Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second on 4 Cores
                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)              5.122 M     4.944 M     5.312 M     5.504 M     5.519 M
- Depth    2                     4.604 M     4.635 M     4.686 M     4.768 M     4.732 M
- Depth   16                     5.159 M     5.142 M     5.147 M     5.126 M     4.490 M
- Depth   64                     4.825 M     4.440 M     4.818 M     4.876 M     4.873 M
- Depth  512                     4.958 M     4.877 M     4.351 M     4.944 M     4.980 M


QUWI Queue Local (TP)            2.498 M     3.904 M     5.198 M     5.675 M     6.086 M
- Depth    2                     4.019 M     4.948 M     5.567 M     5.624 M     5.948 M
- Depth   16                     5.607 M     5.972 M     6.123 M     6.198 M     6.085 M
- Depth   64                     5.670 M     5.684 M     5.911 M     6.053 M     6.008 M
- Depth  512                     5.903 M     5.723 M     6.107 M     6.137 M     5.944 M


SubTask Chain Return (TP)      288.674 k   732.182 k     1.048 M     1.092 M     1.130 M
- Depth    2                   319.092 k   813.138 k     1.206 M     1.213 M     1.256 M
- Depth   16                   420.980 k   850.233 k     1.342 M     1.345 M     1.452 M
- Depth   64                   426.900 k   852.220 k     1.320 M     1.340 M     1.446 M
- Depth  512                   458.738 k   983.229 k     1.814 M     1.829 M     1.741 M


SubTask Chain Awaited (TP)     195.067 k   533.452 k   681.902 k   658.270 k   704.345 k
- Depth    2                   230.940 k   499.588 k   697.969 k   713.768 k   755.737 k
- Depth   16                   303.063 k   573.892 k   759.381 k   771.129 k   752.144 k
- Depth   64                   323.481 k   622.457 k   934.510 k   928.668 k   944.062 k
- Depth  512                   372.373 k   784.003 k     1.145 M     1.168 M     1.178 M


SubTask Fanout Awaited (TP)    102.172 k   279.550 k   307.417 k   455.431 k   534.595 k
- Depth    2                   189.322 k   412.879 k   447.633 k   585.927 k   630.344 k
- Depth   16                   422.503 k   549.162 k   733.071 k   778.770 k   745.301 k
- Depth   64                   580.335 k   664.699 k   774.576 k   803.601 k   806.931 k
- Depth  512                   730.656 k   780.393 k   818.605 k   830.610 k   813.356 k


Continuation Chain (TP)        112.152 k   264.359 k   429.889 k   461.448 k   554.082 k
- Depth    2                   188.273 k   446.409 k   726.150 k   830.212 k   898.507 k
- Depth   16                   473.160 k     1.101 M     2.039 M     2.313 M     2.375 M
- Depth   64                   581.500 k     1.330 M     2.787 M     2.837 M     2.796 M
- Depth  512                   643.049 k     1.561 M     3.041 M     3.016 M     3.037 M


Continuation Fanout (TP)        86.663 k   224.153 k   318.949 k   384.670 k   409.053 k
- Depth    2                   166.196 k   458.905 k   609.650 k   614.899 k   614.684 k
- Depth   16                   895.730 k     1.420 M     1.289 M     1.295 M     1.286 M
- Depth   64                     1.083 M     1.887 M     1.474 M     1.483 M     1.476 M
- Depth  512                     1.111 M     2.048 M     1.533 M     1.483 M     1.485 M


Yield Chain Awaited (TP)       250.375 k   573.909 k   943.017 k   947.237 k     1.177 M
- Depth    2                   334.602 k   702.286 k     1.051 M     1.139 M     1.413 M
- Depth   16                   457.119 k   870.957 k     1.310 M     1.623 M     1.721 M
- Depth   64                   451.213 k   812.185 k     1.628 M     1.733 M     1.840 M
- Depth  512                Bus error: 10

kouvel

Still reviewing, a few comments so far

kouvel · 2019-02-06T02:36:17Z

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs

+        class WaiterListEntry
+        {
+            public LowLevelMonitor _monitor;
+            public WaiterListEntry _next;


I thought that you may use this implementation for Mono. It shouldn't be necessary to change the implementation in CoreRT for that purpose, what it was doing before is more appropriate for CoreRT. It would probably be beneficial to separate LowLevelLifoSemaphore, which offers just the base LIFO semaphore functionality (no spinning, etc.), from ClrLifoSemaphore, which does all of the rest. Maybe ClrLifoSemaphore could use a better name but I would leave that for later, as there are more abstractions that could be done. ClrLifoSemaphore would use LowLevelLifoSemaphore. With that, the implementation of ClrLifoSemaphore can be shared between runtimes and the only runtime-specific dependency would be LowLevelLifoSemaphore, for which runtimes can provide an implementation that is suitable. For CoreRT, that would be using the wait subsystem, for Mono it may be this implementation, for CoreCLR it would be the PAL, and so on.

I didn't split up LowLevelLifoSemaphore and ClrLifoSemaphore because technically there's only one user of it - the thread pool. There's probably never going to be another one because there's SemaphoreSlim, so as long as you specifically don't depend on uninterruptibility and LIFO order (at the same time) there's no reason to use this one. LowLevelLock already contained spinning code, so it seemed acceptable that spinning is part of the low-level code. The LowLevel prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).

The runtime sharing is a bit of a PITA part that I had to put some thought into. I am trying hard to avoid a Mono-specific implementation (as long as I can avoid hurting CoreRT at the same time).

While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here. Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).

My current plan is to break this dependency on WaitSubsystem on this one place and move LowLevelLock/LowLevelMonitor/LowLevelSemaphore into shared part of CoreLib as a feature (eg. FeaturePortableLowLevelSync). That would be shared by CoreRT and Mono. CoreCLR could use it (for testing at least), or simply skip the feature and implement the primitives on its own through existing PAL code.

Merging LowLevelLifoSemaphore and ClrLifoSemaphore is fine, the wait portions can still be specialized for runtimes, and I think they should be. I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated. The tracking being done here and additional overhead is unnecessary for those runtimes, which already do similar tracking to support all of the other types of waits and the tracking is shared so there is no additional overhead with uninterruptibility. Adding uninterruptibility in both runtimes was actually very easy. Conversely it was actually a lot more work to add interruptibility to CoreRT's wait subsystem, which started off uninterruptible during initial implementation. If it's not easy to modify Mono's semaphore to support uninterruptible waits, I would still prefer to separate out this implementation specifically for Mono.

The LowLevel prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).

At the moment the LowLevel part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components. LowLevelLock does some spinning, in hindsight it probably would have made sense to separate it out. The idea was that the LowLevel portions would be system dependencies that may be specific to runtimes (which may also be shared if possible) and anything on top of them could be shared. Even if there is only one user currently, it can help to componentize the functionality.

While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here.

Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.

Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).

I don't see how this implementation would be better. Typically these locks are uncontended. It's rare for two threads to both be releasing threads because the thread pool uses an interlocked operation to figure out how many threads need to be released and one of those two threads would release all of them. Perf is not a big concern but the wait subsystem is actually pretty fast.

Even if there is only one user currently, it can help to componentize the functionality.

Granted sometimes it can be advantageous to not componentize in order to add some features

I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated.

I do have different goals for this, which makes it hard agenda to push. I think it's not unnecessarily more complicated because the amount of code removed from the wait subsystem is more than the amount of code added here. I will do some more benchmarks of different variations, but I actually think that the contention on s_lock in WaitSubsystem hurts the performance significantly for heavily threaded code and this is an easy way out. The suggestion with CoreCLR was not about using it there for production code, but only as a means of bring up of the code in another runtime and for apples-to-apples performance comparison.

At the moment the LowLevel part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components.

The documentation states

/// <see cref="LowLevelLock"/> and <see cref="LowLevelMonitor"/> /// - These are "low level" in the sense they don't depend on this wait subsystem, and any waits done are not /// interruptible

LowLevelSemaphore never used the system component in Unix, it used the wait subsystem, violating any common sense about the component layering. In that sense calling it "low level" is wrong. The only other shared property is the uninterruptibility. I think it should be implemented using system components, but if I fail to push that direction then it should definitely be renamed to avoid confusion.

Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.

No overhead in terms of performance for other code paths through the WaitSubsystem. Amount of code is different story though. Performance-wise it increases contention on the single lock. It's orthogonal issue to these code changes though.

re the naming of LowLevelSemaphore: I chose that name moreso to match the naming of LowLevelLinq (subset of API surface reimplemented for runtime use) than LowLevelLock or LowLevelMonitor (thin managed wrappers around native system components).

It is also not necessary to port the wait subsystem, that is a runtime-specific piece currently. Perhaps it could be shared in the future but it doesn't have to be at the moment. Is there some other agenda you're trying to push for?

The whole point is to share as much code as possible between the runtimes.

Code that is runtime specific doesn't get as much testing as the one that is shared. Likewise the performance improvements have historically been done mostly on the CoreCLR side and didn't translate to CoreRT which was a shame. Now that Mono is sharing the CoreLib code there was a push that helped to fix the gap by moving more of the code to the shared part of the CoreLib.

Ideally I would like to see the wait subsystem shared between all three runtimes, but we are still quite far away from there and it may well never happen.

Decoupling individual components - thread pool, timers, low level synchronization primitives - was an attempt to get there faster, piece by piece. Unfortunately I underestimated the complexity of the LIFO semaphore implementation and its edge cases. I assumed I could do it in less than 100 lines and simplify the WaitSubsystem at the same time to the point that it would be a win-win situation overall (no regression, cleaner component separation, the benefits of code being shared and tested by 2 runtimes and allowing testing the whole managed pool on CoreCLR as well). Eventually I ended up with an implementation that is small enough and covers all the edge cases that were brought up, but it requires unmanaged code and it's not as mature as whatever WaitSubsystem does now.

This implementation adds another monitor per thread, allocates another object per thread, has issues with OOMs...

Yeah, it was flawed. I have completely rewritten it in the end, but it's out of scope of this PR. I solved couple of the issues by moving the LIFO semaphore to unmanaged code, reducing the number of PThread objects (1 mutex + 1 condition/waiting thread) and storing the LIFO entries on stack. (Unfinished code here: https://gist.github.com/filipnavara/4d58444435cb106bbc4e3ffa939e99be, missing comments and propert pthread_cond_t initalization on non-macOS systems).

The global lock is held for a very short time, I have not seen significant contention on that lock.

I have seen it in the heavy threading+locking benchmarks before, but it could have been flaw in the benchmark methodology. I went ahead and re-run a different combinations of the code with my benchmarks. It turns out that the additional logic with interlocked Counts loops avoids getting to the slow path in those heavy threading scenarios.

I think the name is appropriate, I'm not too particular about the name as long as it sort of works, it's difficult to get names right.

I understand it is difficult. I am not happy about the name, but I don't really have a better name right now. The "low level" part makes me feel it should not call into something "higher level" like WaitSubsystem, implementation detail or not.

re the naming of LowLevelSemaphore: I chose that name moreso to match the naming of LowLevelLinq (subset of API surface reimplemented for runtime use) than LowLevelLock or LowLevelMonitor (thin managed wrappers around native system component

Ah, thanks for helpful historical note :-) LowLevelLinq was not really used in that part of code anymore, aside from a using in one of the files that seems like a leftover.

A good time to share the implementation would be if a system LIFO wait support were to be found. This implementation of the component can be implemented in managed or native code, would be fairly small, and perf doesn't really matter, that I don't see a huge benefit in sharing it at the moment, especially since there are better alternatives for some runtimes. Sharing the portion above this part (the ClrLifoSemaphore part of tracking spinners/waiters, etc.) is a lot more useful.

A good time to share the implementation would be if a system LIFO wait support were to be found.

Not going to happen, at least not in a portable way. Every other library I found (notably Facebook Folly, which is referenced from couple of academic papers) uses an implementation similar to the one I came up with.

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Windows.cs

kouvel

Still reviewing, hope to continue tomorrow

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

kouvel · 2019-02-07T05:05:19Z

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

+
+            public static Counts CompareExchange(ref Counts location, Counts newCounts, Counts oldCounts)
+            {
+                return new Counts { _asLong = Interlocked.CompareExchange(ref location._asLong, newCounts._asLong, oldCounts._asLong) };


It would be more concise to add a constructor that takes the long value and to use that instead of setting the _asLong field on each instantiation.

Again, copied from ClrThreadPool Counts. I would have likely done that as you suggested but I copied existing code for consistency.

Not a big deal, probably would be a bit cleaner

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

kouvel · 2019-02-07T05:09:14Z

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

+
+            public override int GetHashCode()
+            {
+                return (int)(_asLong >> 8);


This should never be used, perhaps do Debug.Fail() and return 0. Alternatively if you want a somewhat useful value (probably not worth it), could return a sum of each of the components.

Only copied it for consistency with Counts in ClrThreadPool. It should never be used, so Debug.Fail may be in order.

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

src/System.Private.CoreLib/src/System/Threading/WaitHandle.cs

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

filipnavara · 2019-02-07T10:17:00Z

Benchmark after updating the spin count default to match CoreCLR value [but not the PAL *2 multiplier] and spinning algorithm:

Method	Toolchain	WIC	Mean	Error	StdDev	Median
Burst	Base	1	1,428.26 ms	31.2428 ms	66.5810 ms	1,408.86 ms
Burst	PR	1	861.98 ms	17.1336 ms	25.6448 ms	854.40 ms
Burst	Base	10	240.32 ms	4.8021 ms	8.6593 ms	239.42 ms
Burst	PR	10	125.65 ms	2.4182 ms	2.4833 ms	126.34 ms
Burst	Base	100	44.68 ms	0.8792 ms	1.1432 ms	44.86 ms
Burst	PR	100	35.57 ms	0.7081 ms	1.4305 ms	36.16 ms
Sustained	Base	?	20.33 ms	0.2658 ms	0.2356 ms	20.43 ms
Sustained	PR	?	21.69 ms	0.3061 ms	0.2863 ms	21.61 ms

See dotnet/corert#6955 (comment). BenchmakeDotNet reflects on this attribute here: https://github.com/dotnet/BenchmarkDotNet/blob/a76f438a81e3e0235a2e9b1249a5280038189689/src/BenchmarkDotNet/Extensions/AssemblyExtensions.cs#L28-L38

adamsitnik · 2019-02-07T19:24:48Z

@filipnavara if you are looking for more ThreadPool benchmarks you should take a look at what we have in the performance repo:

https://github.com/dotnet/performance/blob/74fca49ecd1f0eae51b0172bd121ee7d0fdd2b6d/src/benchmarks/micro/corefx/System.Threading.ThreadPool/Perf.ThreadPool.cs#L11-L38

filipnavara · 2019-02-07T21:32:11Z

@adamsitnik Thanks! I looked in the performance repository before, but I missed that one. I will definitely give it a try.

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

* Allow reflecting on DebuggableAttribute on CoreRT See dotnet/corert#6955 (comment). BenchmakeDotNet reflects on this attribute here: https://github.com/dotnet/BenchmarkDotNet/blob/a76f438a81e3e0235a2e9b1249a5280038189689/src/BenchmarkDotNet/Extensions/AssemblyExtensions.cs#L28-L38 * Comment

…n WaitSubsystem, they are no longer used" This reverts commit 938f134.

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

src/System.Private.CoreLib/src/System/Threading/LowLevelSpinWaiter.cs

kouvel · 2019-02-14T19:18:41Z

src/System.Private.CoreLib/src/System/Threading/LowLevelSpinWaiter.cs

+                        }
+                        else
+                        {
+                            RuntimeThread.Yield();


The processorCount <= 1 case is already covered in the initial spin index. It would be a bit cleaner to change the use of SpinYieldThreshold above to SpinSleep0Threshold (then SpinYieldThreshold would not be used anymore and can be removed), and revert this block to what it was before.

It is not covered because Wait method now also spins on every second iteration. A better solution would be to keep spin counter inside LowLevelSpinWaiter, but that's refactoring I would like to do separately.

Perhaps Wait could check s_processorCount and sleep instead of spin, instead of checking the proc count at the caller side

Yes, that's what I would want to do... but the initialization of the static variables and reading of the processor count is kinda funky there. I wasn't sure if it's safe to change it or if the code is possibly reached from some early static constructor initialization and it would result in some dangerous dependency.

Oh ok, yea the wait subsystem and stuff it uses are used in lazy class construction. It would work to change the Initialize method into a static method (maybe rename to EnsureInitialized or something like that) and call it before any spin-wait loops that use this type. Not sure if it's the best solution. I think I had considered attributing this class with EagerStaticClassConstruction but maybe there was some issue with it, I don't remember the details.

I tried to reorganize it a little in the last commit (cffc32d). It's not perfect, but it's a bit more readable. I changed it to use PlatformHelper.ProcessorCount which is supposed to respond to change in number of processors and still do reasonable caching that works with the static constructor initialization.

Damn, looks like it broke the universe. I'll have to debug it on my other machine tomorrow... Reverting for now.

src/System.Private.CoreLib/src/System/Threading/LowLevelSpinWaiter.cs

… to avoid bringing in too many dependencies

kouvel

Thanks!

…itial signal count - Port of a fix from dotnet/corert#6955 - The underlying semaphore is only used to wake up waiters, initially there are no waiters and the signal count should be zero. This was already the case on Windows, this fixes the Unix side. The actual initial signal count is tracked in the upper layer counts. - The initial signal count passed in is zero anyway in the places where it's used, so it makes no difference for now, just some cleanup

…itial signal count (#22632) - Port of a fix from dotnet/corert#6955 - The underlying semaphore is only used to wake up waiters, initially there are no waiters and the signal count should be zero. This was already the case on Windows, this fixes the Unix side. The actual initial signal count is tracked in the upper layer counts. - The initial signal count passed in is zero anyway in the places where it's used, so it makes no difference for now, just some cleanup

filipnavara mentioned this pull request Feb 5, 2019

Win32 API invoked on macOS #6956

Closed

filipnavara force-pushed the lowlevelsync branch from 3309ce3 to f214638 Compare February 5, 2019 17:13

filipnavara commented Feb 5, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs Outdated Show resolved Hide resolved

filipnavara commented Feb 5, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs Show resolved Hide resolved

filipnavara force-pushed the lowlevelsync branch 3 times, most recently from feb309e to 4026aa7 Compare February 6, 2019 15:45

filipnavara added 3 commits February 6, 2019 16:47

Update LowLevelLifoSemaphore to more closely mimic CoreCLR.

3a27429

Add cache line separation

5e472b7

Remove support for uninterruptible and/or prioritized waits in WaitSu…

938f134

…bsystem, they are no longer used

filipnavara force-pushed the lowlevelsync branch from 4026aa7 to 938f134 Compare February 6, 2019 15:47

kouvel reviewed Feb 7, 2019

View reviewed changes

filipnavara added 5 commits February 7, 2019 10:23

Remove obsolete comment

0dc3770

Rename Wake/WaitForWake to ReleaseCore/WaitCore

41f83b9

Rename Counts.ExchangeSubtract to Subtract to reflect the semantics

265f41a

Add missing sealed keyword and Assert

6b34732

Make thread pool worker spin count configurable

9d2fad4

MichalStrehovsky mentioned this pull request Feb 7, 2019

Allow reflecting on DebuggableAttribute on CoreRT dotnet/BenchmarkDotNet#1055

Merged

Replicate the spinning algorithm from CLRLifoSemaphore

579bd96

kouvel reviewed Feb 8, 2019

View reviewed changes

filipnavara added 8 commits February 11, 2019 22:37

Update Counts to use instance methods, add Assert for spinCount >= 0

750db85

Revert "Remove support for uninterruptible and/or prioritized waits i…

e7afe31

…n WaitSubsystem, they are no longer used" This reverts commit 938f134.

Switch LowLevelLifoSemaphore.Unix back to WaitSystem implementation

ddfb816

Remove obsolete internal API

81dc860

Add comment

e60f673

Fix asserts

2aca3bd

Move more of the spinning logic to LowLevelSpinWaiter

05e2183

Remove obsolete reference to LowLevelLinq

9a3678e

filipnavara changed the title ~~WIP: Improve thread pool performance on Unix~~ Improve thread pool performance on Unix Feb 11, 2019

kouvel reviewed Feb 14, 2019

View reviewed changes

filipnavara added 2 commits February 14, 2019 23:05

Replace SpinYieldThreshold with SpinSleep0Threshold

9435818

Update Asserts based on feedback

4678e25

kouvel reviewed Feb 14, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Threading/LowLevelSpinWaiter.cs Outdated Show resolved Hide resolved

filipnavara force-pushed the lowlevelsync branch 6 times, most recently from da1f2cf to 34cc5e5 Compare February 15, 2019 00:15

filipnavara closed this Feb 15, 2019

filipnavara reopened this Feb 15, 2019

Reorganize the uniprocessor spinning code paths

cd0979d

filipnavara force-pushed the lowlevelsync branch from 818e39d to cd0979d Compare February 15, 2019 00:48

Use Runtime.RuntimeImports.RhGetProcessCpuCount in LowLevelSpinWaiter…

2a109f3

… to avoid bringing in too many dependencies

kouvel approved these changes Feb 15, 2019

View reviewed changes

kouvel merged commit 4666d6f into dotnet:master Feb 15, 2019

kouvel mentioned this pull request Feb 15, 2019

In CLRLifoSemaphore, start the underlying semaphore with 0 for the initial signal count dotnet/coreclr#22632

Merged

filipnavara mentioned this pull request Oct 19, 2019

Move portable thread pool to shared partition #7828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve thread pool performance on Unix #6955

Improve thread pool performance on Unix #6955

filipnavara commented Feb 5, 2019 •

edited

Loading

filipnavara commented Feb 5, 2019

benaadams commented Feb 5, 2019

filipnavara commented Feb 5, 2019 •

edited

Loading

filipnavara commented Feb 6, 2019 •

edited

Loading

kouvel left a comment

kouvel Feb 6, 2019

filipnavara Feb 7, 2019 •

edited

Loading

kouvel Feb 7, 2019 •

edited

Loading

kouvel Feb 7, 2019

filipnavara Feb 7, 2019

jkoritzinsky Feb 8, 2019

filipnavara Feb 11, 2019 •

edited

Loading

filipnavara Feb 11, 2019

kouvel Feb 14, 2019

filipnavara Feb 14, 2019

kouvel left a comment

kouvel Feb 7, 2019

filipnavara Feb 7, 2019

kouvel Feb 7, 2019

kouvel Feb 7, 2019

filipnavara Feb 7, 2019

filipnavara commented Feb 7, 2019 •

edited

Loading

adamsitnik commented Feb 7, 2019

filipnavara commented Feb 7, 2019

kouvel Feb 14, 2019

filipnavara Feb 14, 2019

kouvel Feb 14, 2019

filipnavara Feb 14, 2019

kouvel Feb 14, 2019

filipnavara Feb 14, 2019 •

edited

Loading

filipnavara Feb 14, 2019

kouvel left a comment

Improve thread pool performance on Unix #6955

Improve thread pool performance on Unix #6955

Conversation

filipnavara commented Feb 5, 2019 • edited Loading

filipnavara commented Feb 5, 2019

benaadams commented Feb 5, 2019

filipnavara commented Feb 5, 2019 • edited Loading

filipnavara commented Feb 6, 2019 • edited Loading

kouvel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

filipnavara Feb 7, 2019 • edited Loading

Choose a reason for hiding this comment

kouvel Feb 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

filipnavara Feb 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

filipnavara commented Feb 7, 2019 • edited Loading

adamsitnik commented Feb 7, 2019

filipnavara commented Feb 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

filipnavara Feb 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel left a comment

Choose a reason for hiding this comment

filipnavara commented Feb 5, 2019 •

edited

Loading

filipnavara commented Feb 5, 2019 •

edited

Loading

filipnavara commented Feb 6, 2019 •

edited

Loading

filipnavara Feb 7, 2019 •

edited

Loading

kouvel Feb 7, 2019 •

edited

Loading

filipnavara Feb 11, 2019 •

edited

Loading

filipnavara commented Feb 7, 2019 •

edited

Loading

filipnavara Feb 14, 2019 •

edited

Loading