Skip to content
This repository has been archived by the owner on Nov 1, 2020. It is now read-only.

Improve thread pool performance on Unix #6955

Merged
merged 21 commits into from
Feb 15, 2019
Merged

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Feb 5, 2019

There's couple of objectives to these changes:

  • Improve performance of thread pool on Unix for certain workloads [while not regressing it for others]
  • Make the LowLevel* synchronization primitives self-dependent and remove the dependency on WaitSubsystem, to allow sharing them with Mono
  • Bring the code closer to CoreCLR in case someone decides to switch to the managed thread pool one day

List of changes:

  • FirstLevelSpinWaiter is renamed to LowLevelSpinWaiter for consistency with the usage of LowLevel prefix in the wait subsystem
  • The constants and Wait method of LowLevelSpinWaiter are exposed for reuse from LowLevelLifoSemaphore
  • LowLevelLifoSemaphore is updated to more closely mimic CoreCLR's CLRLifoSemaphore spinning logic
  • LowLevelLifoSemaphore.Unix.cs is reimplemented on top of LowLevel[Monitor/Lock] to remove dependency on the complex WaitSubsystem and its contention on a single lock

I used improvised benchmark based on code from dotnet/coreclr#13670 (comment). It is likely flawed, but it should be useful for basic comparison.

Benchmark results are included below. On macOS it is consistent improvement over baseline with no regression on sustained load. I will run the benchmarks on Windows too and report the results. At the moment I don't have a Linux setup to test this on.

This is part of a bigger patch series that is supposed to make the portable thread pool implementation sharable across .NET runtimes. Initially I target CoreRT/Unix and Mono, but I would like to allow building CoreCLR with it for comparison purposes as well.

/cc @kouvel @jkoritzinsky @jkotas @benaadams @marek-safar

@filipnavara
Copy link
Member Author

BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14.2 (18C48a) [Darwin 18.2.0]
Intel Core i5-5675R CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
  [Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT

Job=CoreRT  Runtime=CoreRT  
Method Toolchain Operations maxWorkItemCount Mean Error StdDev
ThreadPoolBurstWorkThroughput Base 100000 1 1,960.71 ms 38.3157 ms 71.966 ms
ThreadPoolBurstWorkThroughput PR 100000 1 1,085.88 ms 34.9250 ms 101.878 ms
ThreadPoolBurstWorkThroughput Base 100000 10 250.03 ms 6.8982 ms 20.339 ms
ThreadPoolBurstWorkThroughput PR 100000 10 145.00 ms 5.4050 ms 15.767 ms
ThreadPoolBurstWorkThroughput Base 100000 100 49.06 ms 0.9795 ms 2.580 ms
ThreadPoolBurstWorkThroughput PR 100000 100 36.54 ms 2.0532 ms 5.689 ms
ThreadPoolSustainedWorkThroughput Base 100000 ? 21.40 ms 0.4233 ms 1.144 ms
ThreadPoolSustainedWorkThroughput PR 100000 ? 21.12 ms 0.6427 ms 1.895 ms

@benaadams
Copy link
Member

/cc @VSadov

@filipnavara
Copy link
Member Author

filipnavara commented Feb 5, 2019

The difference on Windows (FeaturePortableThreadPool enabled) seems small, but still favorable for the change. Benchmarking it on a laptop however gives the thermal management a hard time, so it's not very precise.

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i5-7267U CPU 3.10GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT

Job=CoreRT  Runtime=CoreRT  
Method Toolchain Operations maxWorkItemCount Mean Error StdDev
ThreadPoolBurstWorkThroughput Base 100000 1 1,182.95 ms 22.5835 ms 22.1800 ms
ThreadPoolBurstWorkThroughput PR 100000 1 712.75 ms 13.8821 ms 13.6341 ms
ThreadPoolBurstWorkThroughput Base 100000 10 63.06 ms 1.3949 ms 1.7131 ms
ThreadPoolBurstWorkThroughput PR 100000 10 57.47 ms 0.2482 ms 0.2322 ms
ThreadPoolBurstWorkThroughput Base 100000 100 23.55 ms 0.4624 ms 0.6631 ms
ThreadPoolBurstWorkThroughput PR 100000 100 20.48 ms 0.0812 ms 0.0719 ms
ThreadPoolSustainedWorkThroughput Base 100000 ? 14.69 ms 0.0295 ms 0.0246 ms
ThreadPoolSustainedWorkThroughput PR 100000 ? 14.51 ms 0.1514 ms 0.1342 ms

@filipnavara filipnavara force-pushed the lowlevelsync branch 3 times, most recently from feb309e to 4026aa7 Compare February 6, 2019 15:45
@filipnavara
Copy link
Member Author

filipnavara commented Feb 6, 2019

I tried running @benaadams' benchmark from https://github.com/benaadams/ThreadPoolTaskTesting. The results are below, but don't read too much into them since it turns out that on my test machine the numbers vary quite wildly with each run (some of the numbers differ by 20% between two consecutive runs). I am not sure whether there is something I can do about it and if anything can be read from the numbers at all. I also consistently get Bus error: 10 from the last test.

I didn't try to force high priority for the process which would likely make the numbers more consistent. Unfortunately both my MacBook and iMac eventually hit thermal issues under these tests (generally, any long running tests on thread pool) and either start very loud cooling or CPU throttling.

Pre:

Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second on 4 Cores
                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)              5.092 M     4.871 M     5.168 M     5.385 M     5.327 M
- Depth    2                     4.712 M     4.439 M     4.673 M     4.714 M     4.693 M
- Depth   16                     4.922 M     4.709 M     4.930 M     4.934 M     4.903 M
- Depth   64                     4.963 M     5.004 M     5.026 M     5.017 M     4.978 M
- Depth  512                     5.027 M     5.044 M     4.963 M     4.983 M     4.959 M


QUWI Queue Local (TP)            2.586 M     4.350 M     5.972 M     6.049 M     5.982 M
- Depth    2                     3.742 M     4.902 M     5.950 M     6.010 M     5.923 M
- Depth   16                     5.772 M     6.008 M     6.183 M     6.191 M     6.088 M
- Depth   64                     5.491 M     5.478 M     5.943 M     5.669 M     5.803 M
- Depth  512                     5.876 M     5.965 M     6.069 M     6.170 M     6.087 M


SubTask Chain Return (TP)      207.876 k   771.453 k     1.242 M     1.158 M     1.145 M
- Depth    2                   242.801 k   643.887 k     1.208 M     1.242 M     1.275 M
- Depth   16                   324.677 k   738.152 k     1.389 M     1.378 M     1.463 M
- Depth   64                   324.371 k   759.935 k     1.416 M     1.419 M     1.477 M
- Depth  512                   348.896 k   886.726 k     1.810 M     1.832 M     1.796 M


SubTask Chain Awaited (TP)     165.487 k   397.849 k   696.324 k   689.829 k   736.044 k
- Depth    2                   172.645 k   410.792 k   630.540 k   719.917 k   779.266 k
- Depth   16                   199.105 k   473.097 k   771.422 k   811.847 k   848.522 k
- Depth   64                   210.068 k   518.456 k   928.639 k   933.719 k   961.701 k
- Depth  512                   244.118 k   671.726 k     1.175 M     1.198 M     1.183 M


SubTask Fanout Awaited (TP)     95.042 k   210.537 k   357.743 k   469.880 k   541.074 k
- Depth    2                   156.743 k   328.988 k   454.231 k   578.414 k   643.199 k
- Depth   16                   361.232 k   557.166 k   741.434 k   791.009 k   775.568 k
- Depth   64                   514.194 k   697.478 k   811.291 k   817.903 k   813.941 k
- Depth  512                   667.601 k   785.942 k   824.527 k   844.484 k   821.210 k


Continuation Chain (TP)         99.853 k   228.534 k   435.238 k   473.983 k   551.314 k
- Depth    2                   164.696 k   377.018 k   718.431 k   850.049 k   931.871 k
- Depth   16                   473.788 k     1.101 M     2.212 M     2.302 M     2.320 M
- Depth   64                   644.677 k     1.432 M     2.719 M     2.838 M     2.823 M
- Depth  512                   752.146 k     1.657 M     2.863 M     3.023 M     3.019 M


Continuation Fanout (TP)        76.665 k   183.625 k   318.045 k   376.486 k   423.932 k
- Depth    2                   128.625 k   344.905 k   612.668 k   609.207 k   584.534 k
- Depth   16                   657.940 k     1.204 M     1.263 M     1.284 M     1.286 M
- Depth   64                     1.136 M     1.788 M     1.474 M     1.470 M     1.463 M
- Depth  512                     1.235 M     2.074 M     1.542 M     1.550 M     1.558 M


Yield Chain Awaited (TP)       282.051 k   525.728 k   949.500 k   932.280 k     1.143 M
- Depth    2                   325.479 k   639.795 k     1.039 M     1.144 M     1.417 M
- Depth   16                   480.714 k   846.826 k     1.441 M     1.683 M     1.719 M
- Depth   64                   499.317 k   808.951 k     1.643 M     1.752 M     1.840 M
- Depth  512                Bus error: 10

Post:

Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second on 4 Cores
                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)              5.122 M     4.944 M     5.312 M     5.504 M     5.519 M
- Depth    2                     4.604 M     4.635 M     4.686 M     4.768 M     4.732 M
- Depth   16                     5.159 M     5.142 M     5.147 M     5.126 M     4.490 M
- Depth   64                     4.825 M     4.440 M     4.818 M     4.876 M     4.873 M
- Depth  512                     4.958 M     4.877 M     4.351 M     4.944 M     4.980 M


QUWI Queue Local (TP)            2.498 M     3.904 M     5.198 M     5.675 M     6.086 M
- Depth    2                     4.019 M     4.948 M     5.567 M     5.624 M     5.948 M
- Depth   16                     5.607 M     5.972 M     6.123 M     6.198 M     6.085 M
- Depth   64                     5.670 M     5.684 M     5.911 M     6.053 M     6.008 M
- Depth  512                     5.903 M     5.723 M     6.107 M     6.137 M     5.944 M


SubTask Chain Return (TP)      288.674 k   732.182 k     1.048 M     1.092 M     1.130 M
- Depth    2                   319.092 k   813.138 k     1.206 M     1.213 M     1.256 M
- Depth   16                   420.980 k   850.233 k     1.342 M     1.345 M     1.452 M
- Depth   64                   426.900 k   852.220 k     1.320 M     1.340 M     1.446 M
- Depth  512                   458.738 k   983.229 k     1.814 M     1.829 M     1.741 M


SubTask Chain Awaited (TP)     195.067 k   533.452 k   681.902 k   658.270 k   704.345 k
- Depth    2                   230.940 k   499.588 k   697.969 k   713.768 k   755.737 k
- Depth   16                   303.063 k   573.892 k   759.381 k   771.129 k   752.144 k
- Depth   64                   323.481 k   622.457 k   934.510 k   928.668 k   944.062 k
- Depth  512                   372.373 k   784.003 k     1.145 M     1.168 M     1.178 M


SubTask Fanout Awaited (TP)    102.172 k   279.550 k   307.417 k   455.431 k   534.595 k
- Depth    2                   189.322 k   412.879 k   447.633 k   585.927 k   630.344 k
- Depth   16                   422.503 k   549.162 k   733.071 k   778.770 k   745.301 k
- Depth   64                   580.335 k   664.699 k   774.576 k   803.601 k   806.931 k
- Depth  512                   730.656 k   780.393 k   818.605 k   830.610 k   813.356 k


Continuation Chain (TP)        112.152 k   264.359 k   429.889 k   461.448 k   554.082 k
- Depth    2                   188.273 k   446.409 k   726.150 k   830.212 k   898.507 k
- Depth   16                   473.160 k     1.101 M     2.039 M     2.313 M     2.375 M
- Depth   64                   581.500 k     1.330 M     2.787 M     2.837 M     2.796 M
- Depth  512                   643.049 k     1.561 M     3.041 M     3.016 M     3.037 M


Continuation Fanout (TP)        86.663 k   224.153 k   318.949 k   384.670 k   409.053 k
- Depth    2                   166.196 k   458.905 k   609.650 k   614.899 k   614.684 k
- Depth   16                   895.730 k     1.420 M     1.289 M     1.295 M     1.286 M
- Depth   64                     1.083 M     1.887 M     1.474 M     1.483 M     1.476 M
- Depth  512                     1.111 M     2.048 M     1.533 M     1.483 M     1.485 M


Yield Chain Awaited (TP)       250.375 k   573.909 k   943.017 k   947.237 k     1.177 M
- Depth    2                   334.602 k   702.286 k     1.051 M     1.139 M     1.413 M
- Depth   16                   457.119 k   870.957 k     1.310 M     1.623 M     1.721 M
- Depth   64                   451.213 k   812.185 k     1.628 M     1.733 M     1.840 M
- Depth  512                Bus error: 10

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, a few comments so far

class WaiterListEntry
{
public LowLevelMonitor _monitor;
public WaiterListEntry _next;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that you may use this implementation for Mono. It shouldn't be necessary to change the implementation in CoreRT for that purpose, what it was doing before is more appropriate for CoreRT. It would probably be beneficial to separate LowLevelLifoSemaphore, which offers just the base LIFO semaphore functionality (no spinning, etc.), from ClrLifoSemaphore, which does all of the rest. Maybe ClrLifoSemaphore could use a better name but I would leave that for later, as there are more abstractions that could be done. ClrLifoSemaphore would use LowLevelLifoSemaphore. With that, the implementation of ClrLifoSemaphore can be shared between runtimes and the only runtime-specific dependency would be LowLevelLifoSemaphore, for which runtimes can provide an implementation that is suitable. For CoreRT, that would be using the wait subsystem, for Mono it may be this implementation, for CoreCLR it would be the PAL, and so on.

Copy link
Member Author

@filipnavara filipnavara Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't split up LowLevelLifoSemaphore and ClrLifoSemaphore because technically there's only one user of it - the thread pool. There's probably never going to be another one because there's SemaphoreSlim, so as long as you specifically don't depend on uninterruptibility and LIFO order (at the same time) there's no reason to use this one. LowLevelLock already contained spinning code, so it seemed acceptable that spinning is part of the low-level code. The LowLevel prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).

The runtime sharing is a bit of a PITA part that I had to put some thought into. I am trying hard to avoid a Mono-specific implementation (as long as I can avoid hurting CoreRT at the same time).

While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here. Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).

My current plan is to break this dependency on WaitSubsystem on this one place and move LowLevelLock/LowLevelMonitor/LowLevelSemaphore into shared part of CoreLib as a feature (eg. FeaturePortableLowLevelSync). That would be shared by CoreRT and Mono. CoreCLR could use it (for testing at least), or simply skip the feature and implement the primitives on its own through existing PAL code.

Copy link
Member

@kouvel kouvel Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging LowLevelLifoSemaphore and ClrLifoSemaphore is fine, the wait portions can still be specialized for runtimes, and I think they should be. I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated. The tracking being done here and additional overhead is unnecessary for those runtimes, which already do similar tracking to support all of the other types of waits and the tracking is shared so there is no additional overhead with uninterruptibility. Adding uninterruptibility in both runtimes was actually very easy. Conversely it was actually a lot more work to add interruptibility to CoreRT's wait subsystem, which started off uninterruptible during initial implementation. If it's not easy to modify Mono's semaphore to support uninterruptible waits, I would still prefer to separate out this implementation specifically for Mono.

The LowLevel prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).

At the moment the LowLevel part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components. LowLevelLock does some spinning, in hindsight it probably would have made sense to separate it out. The idea was that the LowLevel portions would be system dependencies that may be specific to runtimes (which may also be shared if possible) and anything on top of them could be shared. Even if there is only one user currently, it can help to componentize the functionality.

While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here.

Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.

Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).

I don't see how this implementation would be better. Typically these locks are uncontended. It's rare for two threads to both be releasing threads because the thread pool uses an interlocked operation to figure out how many threads need to be released and one of those two threads would release all of them. Perf is not a big concern but the wait subsystem is actually pretty fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there is only one user currently, it can help to componentize the functionality.

Granted sometimes it can be advantageous to not componentize in order to add some features

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated.

I do have different goals for this, which makes it hard agenda to push. I think it's not unnecessarily more complicated because the amount of code removed from the wait subsystem is more than the amount of code added here. I will do some more benchmarks of different variations, but I actually think that the contention on s_lock in WaitSubsystem hurts the performance significantly for heavily threaded code and this is an easy way out. The suggestion with CoreCLR was not about using it there for production code, but only as a means of bring up of the code in another runtime and for apples-to-apples performance comparison.

At the moment the LowLevel part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components.

The documentation states

    /// <see cref="LowLevelLock"/> and <see cref="LowLevelMonitor"/>
    ///   - These are "low level" in the sense they don't depend on this wait subsystem, and any waits done are not
    ///     interruptible

LowLevelSemaphore never used the system component in Unix, it used the wait subsystem, violating any common sense about the component layering. In that sense calling it "low level" is wrong. The only other shared property is the uninterruptibility. I think it should be implemented using system components, but if I fail to push that direction then it should definitely be renamed to avoid confusion.

Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.

No overhead in terms of performance for other code paths through the WaitSubsystem. Amount of code is different story though. Performance-wise it increases contention on the single lock. It's orthogonal issue to these code changes though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re the naming of LowLevelSemaphore: I chose that name moreso to match the naming of LowLevelLinq (subset of API surface reimplemented for runtime use) than LowLevelLock or LowLevelMonitor (thin managed wrappers around native system components).

Copy link
Member Author

@filipnavara filipnavara Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also not necessary to port the wait subsystem, that is a runtime-specific piece currently. Perhaps it could be shared in the future but it doesn't have to be at the moment. Is there some other agenda you're trying to push for?

The whole point is to share as much code as possible between the runtimes.

Code that is runtime specific doesn't get as much testing as the one that is shared. Likewise the performance improvements have historically been done mostly on the CoreCLR side and didn't translate to CoreRT which was a shame. Now that Mono is sharing the CoreLib code there was a push that helped to fix the gap by moving more of the code to the shared part of the CoreLib.

Ideally I would like to see the wait subsystem shared between all three runtimes, but we are still quite far away from there and it may well never happen.

Decoupling individual components - thread pool, timers, low level synchronization primitives - was an attempt to get there faster, piece by piece. Unfortunately I underestimated the complexity of the LIFO semaphore implementation and its edge cases. I assumed I could do it in less than 100 lines and simplify the WaitSubsystem at the same time to the point that it would be a win-win situation overall (no regression, cleaner component separation, the benefits of code being shared and tested by 2 runtimes and allowing testing the whole managed pool on CoreCLR as well). Eventually I ended up with an implementation that is small enough and covers all the edge cases that were brought up, but it requires unmanaged code and it's not as mature as whatever WaitSubsystem does now.

This implementation adds another monitor per thread, allocates another object per thread, has issues with OOMs...

Yeah, it was flawed. I have completely rewritten it in the end, but it's out of scope of this PR. I solved couple of the issues by moving the LIFO semaphore to unmanaged code, reducing the number of PThread objects (1 mutex + 1 condition/waiting thread) and storing the LIFO entries on stack. (Unfinished code here: https://gist.github.com/filipnavara/4d58444435cb106bbc4e3ffa939e99be, missing comments and propert pthread_cond_t initalization on non-macOS systems).

The global lock is held for a very short time, I have not seen significant contention on that lock.

I have seen it in the heavy threading+locking benchmarks before, but it could have been flaw in the benchmark methodology. I went ahead and re-run a different combinations of the code with my benchmarks. It turns out that the additional logic with interlocked Counts loops avoids getting to the slow path in those heavy threading scenarios.

I think the name is appropriate, I'm not too particular about the name as long as it sort of works, it's difficult to get names right.

I understand it is difficult. I am not happy about the name, but I don't really have a better name right now. The "low level" part makes me feel it should not call into something "higher level" like WaitSubsystem, implementation detail or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re the naming of LowLevelSemaphore: I chose that name moreso to match the naming of LowLevelLinq (subset of API surface reimplemented for runtime use) than LowLevelLock or LowLevelMonitor (thin managed wrappers around native system component

Ah, thanks for helpful historical note :-) LowLevelLinq was not really used in that part of code anymore, aside from a using in one of the files that seems like a leftover.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good time to share the implementation would be if a system LIFO wait support were to be found. This implementation of the component can be implemented in managed or native code, would be fairly small, and perf doesn't really matter, that I don't see a huge benefit in sharing it at the moment, especially since there are better alternatives for some runtimes. Sharing the portion above this part (the ClrLifoSemaphore part of tracking spinners/waiters, etc.) is a lot more useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good time to share the implementation would be if a system LIFO wait support were to be found.

Not going to happen, at least not in a portable way. Every other library I found (notably Facebook Folly, which is referenced from couple of academic papers) uses an implementation similar to the one I came up with.

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, hope to continue tomorrow


public static Counts CompareExchange(ref Counts location, Counts newCounts, Counts oldCounts)
{
return new Counts { _asLong = Interlocked.CompareExchange(ref location._asLong, newCounts._asLong, oldCounts._asLong) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more concise to add a constructor that takes the long value and to use that instead of setting the _asLong field on each instantiation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, copied from ClrThreadPool Counts. I would have likely done that as you suggested but I copied existing code for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, probably would be a bit cleaner


public override int GetHashCode()
{
return (int)(_asLong >> 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be used, perhaps do Debug.Fail() and return 0. Alternatively if you want a somewhat useful value (probably not worth it), could return a sum of each of the components.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only copied it for consistency with Counts in ClrThreadPool. It should never be used, so Debug.Fail may be in order.

@filipnavara
Copy link
Member Author

filipnavara commented Feb 7, 2019

Benchmark after updating the spin count default to match CoreCLR value [but not the PAL *2 multiplier] and spinning algorithm:

Method Toolchain WIC Mean Error StdDev Median
Burst Base 1 1,428.26 ms 31.2428 ms 66.5810 ms 1,408.86 ms
Burst PR 1 861.98 ms 17.1336 ms 25.6448 ms 854.40 ms
Burst Base 10 240.32 ms 4.8021 ms 8.6593 ms 239.42 ms
Burst PR 10 125.65 ms 2.4182 ms 2.4833 ms 126.34 ms
Burst Base 100 44.68 ms 0.8792 ms 1.1432 ms 44.86 ms
Burst PR 100 35.57 ms 0.7081 ms 1.4305 ms 36.16 ms
Sustained Base ? 20.33 ms 0.2658 ms 0.2356 ms 20.43 ms
Sustained PR ? 21.69 ms 0.3061 ms 0.2863 ms 21.61 ms

@adamsitnik
Copy link
Member

@filipnavara if you are looking for more ThreadPool benchmarks you should take a look at what we have in the performance repo:

https://github.com/dotnet/performance/blob/74fca49ecd1f0eae51b0172bd121ee7d0fdd2b6d/src/benchmarks/micro/corefx/System.Threading.ThreadPool/Perf.ThreadPool.cs#L11-L38

@filipnavara
Copy link
Member Author

@adamsitnik Thanks! I looked in the performance repository before, but I missed that one. I will definitely give it a try.

adamsitnik pushed a commit to dotnet/BenchmarkDotNet that referenced this pull request Feb 8, 2019
@filipnavara filipnavara changed the title WIP: Improve thread pool performance on Unix Improve thread pool performance on Unix Feb 11, 2019
}
else
{
RuntimeThread.Yield();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processorCount <= 1 case is already covered in the initial spin index. It would be a bit cleaner to change the use of SpinYieldThreshold above to SpinSleep0Threshold (then SpinYieldThreshold would not be used anymore and can be removed), and revert this block to what it was before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not covered because Wait method now also spins on every second iteration. A better solution would be to keep spin counter inside LowLevelSpinWaiter, but that's refactoring I would like to do separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps Wait could check s_processorCount and sleep instead of spin, instead of checking the proc count at the caller side

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I would want to do... but the initialization of the static variables and reading of the processor count is kinda funky there. I wasn't sure if it's safe to change it or if the code is possibly reached from some early static constructor initialization and it would result in some dangerous dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, yea the wait subsystem and stuff it uses are used in lazy class construction. It would work to change the Initialize method into a static method (maybe rename to EnsureInitialized or something like that) and call it before any spin-wait loops that use this type. Not sure if it's the best solution. I think I had considered attributing this class with EagerStaticClassConstruction but maybe there was some issue with it, I don't remember the details.

Copy link
Member Author

@filipnavara filipnavara Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reorganize it a little in the last commit (cffc32d). It's not perfect, but it's a bit more readable. I changed it to use PlatformHelper.ProcessorCount which is supposed to respond to change in number of processors and still do reasonable caching that works with the static constructor initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, looks like it broke the universe. I'll have to debug it on my other machine tomorrow... Reverting for now.

@filipnavara filipnavara force-pushed the lowlevelsync branch 6 times, most recently from da1f2cf to 34cc5e5 Compare February 15, 2019 00:15
@filipnavara filipnavara reopened this Feb 15, 2019
Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@kouvel kouvel merged commit 4666d6f into dotnet:master Feb 15, 2019
kouvel added a commit to kouvel/coreclr that referenced this pull request Feb 15, 2019
…itial signal count

- Port of a fix from dotnet/corert#6955
- The underlying semaphore is only used to wake up waiters, initially there are no waiters and the signal count should be zero. This was already the case on Windows, this fixes the Unix side. The actual initial signal count is tracked in the upper layer counts.
- The initial signal count passed in is zero anyway in the places where it's used, so it makes no difference for now, just some cleanup
stephentoub pushed a commit to dotnet/coreclr that referenced this pull request Mar 15, 2019
…itial signal count (#22632)

- Port of a fix from dotnet/corert#6955
- The underlying semaphore is only used to wake up waiters, initially there are no waiters and the signal count should be zero. This was already the case on Windows, this fixes the Unix side. The actual initial signal count is tracked in the upper layer counts.
- The initial signal count passed in is zero anyway in the places where it's used, so it makes no difference for now, just some cleanup
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants