Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Merged
merged 3 commits into from
Sep 1, 2017

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Aug 30, 2017

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor

  • The delay incurred by YieldProcessor is measured once lazily at run-time
  • Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
  • Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment

  • At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
  • Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
    • I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
    • ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning

  • At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
  • General observations
    • YieldProcessor stage
      • At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
      • Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
      • Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
      • Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
    • Yield/Sleep(0) stage
      • Observed a couple of issues:
        • When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
        • If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
      • I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
    • Sleep(1) stage
      • I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
        • Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
      • For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
        • For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
        • For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
        • Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
  • SpinWait - Tweaked the spin strategy as mentioned above
  • ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
  • Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
  • SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
  • SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
  • ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
  • The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
    • On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
    • On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

@kouvel kouvel added area-System.Threading tenet-performance Performance related issue labels Aug 30, 2017
@kouvel kouvel added this to the 2.1.0 milestone Aug 30, 2017
@kouvel kouvel self-assigned this Aug 30, 2017
@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Numbers from Thread.SpinWait divide count by 7 experiment:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score       ∆ Score %
------------------------------------------  ----------------  ----------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%     146.93 ±0.32%     11.61%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  16040.14 ±11.77%     -0.32%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%   38814.31 ±0.95%     -1.66%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%    6872.57 ±0.76%     -3.05%
MresWaitDrainRate 1Pc                          556.26 ±0.71%     555.32 ±0.51%     -0.17%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%     525.69 ±0.67%      0.24%
MresWaitDrainRate 2Pc                          679.15 ±0.52%     683.60 ±1.00%      0.66%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%     673.72 ±0.98%      0.84%
MresWaitLatency 1Pc                            505.88 ±0.38%     497.54 ±0.88%     -1.65%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%     442.81 ±0.71%      0.00%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%     110.93 ±1.42%     -2.81%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%     118.51 ±0.48%      0.71%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%      76.91 ±0.88%      8.43%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%     100.60 ±0.78%      9.99%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%    183.44 ±25.25%    -59.77%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%      71.86 ±1.89%      0.77%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%      69.88 ±1.63%      2.27%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%      91.41 ±2.91%     -2.26%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%      90.84 ±0.71%      4.55%
SpinLockLatency 1Pc                            286.86 ±1.00%     284.45 ±0.95%     -0.84%
SpinLockLatency 1Pc Delay                      216.97 ±0.72%     212.25 ±0.69%     -2.18%
SpinLockLatency 2Pc                            142.92 ±2.09%     149.15 ±1.39%      4.36%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%      80.89 ±4.59%      6.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%   44630.26 ±0.33%     -0.44%
------------------------------------------  ----------------  ----------------  ---------
Total                                          418.59 ±5.39%     408.67 ±2.77%     -2.37%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   22543.68 ±0.44%     -3.04%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22974.79 ±0.52%     -1.16%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22638.62 ±0.79%     -1.57%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   16368.50 ±5.28%      3.66%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23679.77 ±0.70%      7.60%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23412.23 ±1.16%      8.61%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23823.90 ±0.76%     12.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   21531.70 ±1.55%     -6.12%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   11174.93 ±6.12%      2.52%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%   6809.70 ±10.01%     -5.93%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   13822.82 ±7.76%    -15.29%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   10170.40 ±9.35%    -28.21%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   15157.21 ±7.66%      0.78%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%  10408.31 ±18.95%      9.26%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   3812.76 ±34.55%    -49.10%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   17888.85 ±9.92%    117.15%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   17661.60 ±7.00%     14.14%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   2945.33 ±30.30%     81.66%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   6391.88 ±26.82%     -4.96%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%  15424.21 ±10.83%     -2.18%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   17318.19 ±6.69%      9.27%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%   20514.35 ±3.81%     -4.80%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   13773.15 ±9.71%      2.50%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   9364.70 ±1.50%     64.47%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  25803.97 ±6.12%    -31.92%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  48910.94 ±0.17%      3.79%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  13560.28 ±1.37%    -60.43%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    699.29 ±0.35%    106.68%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    657.57 ±0.49%     92.19%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    984.25 ±0.47%     55.16%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    884.72 ±0.48%     44.33%
MresWaitLatency 1Pc                           414.26 ±0.49%    610.93 ±0.36%     47.48%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    578.27 ±0.31%     27.21%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    253.97 ±1.18%    -27.80%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    167.06 ±0.82%    -19.52%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%     47.84 ±6.23%     -7.88%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%     31.28 ±4.39%    -32.71%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%  14531.60 ±1.36%      1.13%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%     65.49 ±1.59%    211.30%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%     61.86 ±1.77%    190.74%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%     86.66 ±0.47%    239.85%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.19 ±0.43%     83.42 ±0.47%    231.10%
SpinLockLatency 1Pc                           337.00 ±0.47%    392.09 ±0.72%     16.35%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    342.38 ±1.18%      4.71%
SpinLockLatency 2Pc                           164.61 ±2.36%    173.41 ±2.06%      5.35%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    148.05 ±3.77%     -0.24%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  58856.20 ±0.48%      6.20%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.33 ±1.42%    687.48 ±1.60%     28.18%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   25137.00 ±1.24%     -8.52%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   27044.91 ±0.27%     -1.53%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26741.52 ±0.67%      0.30%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   26194.72 ±0.56%      0.51%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   25318.99 ±1.46%     -6.44%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   22894.77 ±3.36%     -2.97%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   22369.25 ±2.08%    -17.84%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   20954.91 ±3.54%     20.07%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9596.75 ±7.24%     23.99%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   6436.92 ±16.55%     36.53%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%  11517.93 ±11.15%     21.37%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    9158.23 ±8.38%     70.23%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   5534.47 ±15.56%     -0.74%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%   8558.61 ±27.37%      9.15%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   5121.77 ±32.87%     17.59%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5366.95 ±20.87%     94.44%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%   4958.81 ±26.60%      0.57%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%    229.29 ±11.63%    -97.64%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%     185.42 ±9.17%    -96.72%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%    716.59 ±50.04%    -84.15%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   1404.82 ±41.50%    -69.79%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   5082.68 ±24.73%    -13.01%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   6630.10 ±15.71%    -31.69%

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Numbers from this PR:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%    156.62 ±0.09%     18.97%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  37180.17 ±1.16%    131.04%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%  40290.88 ±0.94%      2.08%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%  33169.51 ±1.76%    367.93%
MresWaitDrainRate 1Pc                          556.26 ±0.71%    630.82 ±0.42%     13.40%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%    591.94 ±1.39%     12.88%
MresWaitDrainRate 2Pc                          679.15 ±0.52%    787.68 ±1.46%     15.98%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%    793.68 ±0.27%     18.80%
MresWaitLatency 1Pc                            505.88 ±0.38%    571.14 ±0.55%     12.90%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%    563.03 ±0.47%     27.15%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%    228.34 ±1.02%    100.06%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%    170.44 ±3.62%     44.84%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%    220.51 ±1.75%    210.89%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%    142.45 ±2.65%     55.74%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%   914.93 ±12.88%    100.67%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%    474.65 ±7.27%    565.60%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%   293.44 ±18.24%    329.45%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%    597.74 ±3.01%    539.17%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%    605.61 ±2.29%    596.97%
SpinLockLatency 1Pc                            286.86 ±1.00%    291.97 ±1.29%      1.78%
SpinLockLatency 1Pc Delay                      217.72 ±0.81%    209.04 ±2.73%     -3.99%
SpinLockLatency 2Pc                            142.92 ±2.09%    269.79 ±1.03%     88.77%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%    179.64 ±1.71%    136.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%  48026.99 ±0.77%      7.14%
------------------------------------------  ----------------  ---------------  ---------
Total                                          418.65 ±5.39%    799.70 ±2.96%     91.02%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   23281.16 ±0.11%      0.14%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22884.67 ±0.17%     -1.55%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22711.34 ±0.13%     -1.25%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   15103.34 ±2.92%     -4.36%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23716.49 ±0.33%      7.77%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23451.14 ±0.42%      8.79%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23611.47 ±0.40%     11.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   22004.41 ±0.59%     -4.06%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   10834.01 ±4.16%     -0.61%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%    7426.22 ±8.20%      2.58%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   16688.11 ±2.54%      2.27%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   12205.80 ±3.45%    -13.84%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   10169.08 ±3.46%    -32.38%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%   16779.35 ±5.23%     76.14%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   14328.17 ±5.72%     91.26%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   12386.79 ±5.58%     50.36%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   14103.08 ±8.87%     -8.86%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   15344.30 ±6.25%    846.39%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   13493.06 ±5.50%    100.63%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%   11061.25 ±6.63%    -29.85%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   12762.56 ±9.17%    -19.48%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%  12718.30 ±11.93%    -40.98%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   15420.22 ±4.23%     14.75%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   7549.66 ±6.62%     32.59%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  47712.92 ±1.50%     25.88%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  49046.62 ±0.26%      4.08%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  24094.85 ±3.12%    -29.68%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    781.15 ±0.25%    130.87%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    774.52 ±0.49%    126.37%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    981.16 ±0.15%     54.68%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    964.77 ±0.43%     57.39%
MresWaitLatency 1Pc                           414.26 ±0.49%    890.84 ±0.79%    115.04%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    844.91 ±0.43%     85.86%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    285.31 ±1.06%    -18.89%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    234.78 ±2.05%     13.11%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%    280.19 ±1.01%    439.55%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%    226.19 ±2.62%    386.58%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%   5504.80 ±3.43%    -61.69%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%   176.38 ±18.92%    738.46%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%   130.11 ±21.55%    511.58%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%    517.83 ±0.15%   1930.81%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.41 ±0.31%    466.44 ±0.50%   1735.61%
SpinLockLatency 1Pc                           337.00 ±0.47%    410.08 ±1.63%     21.68%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    347.44 ±2.04%      6.26%
SpinLockLatency 2Pc                           164.61 ±2.36%    357.89 ±2.32%    117.42%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    321.58 ±1.31%    116.69%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  61147.67 ±0.72%     10.33%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.52 ±1.41%   1141.26 ±3.22%    112.71%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   27099.63 ±0.26%     -1.38%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   26101.59 ±0.85%     -4.96%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26892.08 ±0.16%      0.86%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   25022.53 ±0.32%     -3.99%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   28764.28 ±0.36%      6.29%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   28707.49 ±0.29%     21.67%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   24213.02 ±5.62%    -11.06%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   26971.54 ±1.39%     54.55%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9620.09 ±8.70%     24.30%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   11722.87 ±8.26%    148.65%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%   11005.10 ±7.46%     15.96%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    7972.85 ±9.15%     48.20%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   10421.93 ±9.10%     86.92%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%  14165.26 ±10.45%     80.65%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   9627.95 ±12.92%    121.05%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5689.54 ±27.34%    106.13%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%  12242.60 ±13.48%    148.28%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%  10822.92 ±20.22%     11.25%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%   8372.09 ±19.54%     48.28%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%   8471.40 ±21.00%     87.41%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   6870.99 ±22.40%     47.77%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   8183.58 ±10.72%     40.06%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   13739.79 ±9.92%     41.57%

/// when the resource becomes available.
/// </summary>
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than MaxSpinCountBeforeWait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is MaxSpinCountBeforeWait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops renamed that one, will fix

// (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with
// Thread.SpinWait. Otherwise, the following issues occur:
// - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a
// busy -spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra space in "busy -spin"

// contention), they may switch between one another, delaying work that can make progress.
if ((
_count >= YieldThreshold &&
(_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the formatting here reads strangely to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's formatted similarly to:

if (a ||
    b)

where

a ==
    (
        c &&
        d
    )

This is how I typically format multi-line expressions, trying to align parentheses and putting each type of expression (&& or ||) separately, one condition per line unless the whole expression fits on one line. What would you suggest instead? I can separate parts of it into locals if you prefer.

@stephentoub
Copy link
Member

Thanks, @kouvel. Do you have any throughput numbers on the thread pool with this change?

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Code used for Spin perf:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;

internal class Program
{
    private static readonly int ProcessorCount = Environment.ProcessorCount;

    private static void Main(string[] args)
    {
        int ai = 1;
        int threadCount;
        if (args[ai].EndsWith("PcT"))
        {
            double pcMultiplier;
            if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcT".Length), out pcMultiplier))
                return;
            threadCount = Math.Max(1, (int)Math.Round(ProcessorCount * pcMultiplier));
        }
        else if (args[ai].EndsWith("T"))
        {
            if (!int.TryParse(args[ai].Substring(0, args[ai].Length - "T".Length), out threadCount))
                return;
        }
        else
            return;
        ++ai;

        switch (args[0])
        {
            case "MresWaitDrainRate":
                MresWaitDrainRate(threadCount);
                break;
            case "MresWaitLatency":
                MresWaitLatency(threadCount);
                break;
            case "SemaphoreSlimWaitDrainRate":
                SemaphoreSlimWaitDrainRate(threadCount);
                break;
            case "SemaphoreSlimLatency":
                SemaphoreSlimLatency(threadCount);
                break;
            case "SemaphoreSlimThroughput":
                SemaphoreSlimThroughput(threadCount);
                break;
            case "SpinLockLatency":
                SpinLockLatency(threadCount);
                break;
            case "SpinLockThroughput":
                SpinLockThroughput(threadCount);
                break;
            case "ConcurrentBagThroughput":
                ConcurrentBagThroughput(threadCount);
                break;
            case "ConcurrentBagFairness":
                ConcurrentBagFairness(threadCount);
                break;
            case "ConcurrentQueueThroughput":
                ConcurrentQueueThroughput(threadCount);
                break;
            case "ConcurrentQueueFairness":
                ConcurrentQueueFairness(threadCount);
                break;
            case "ConcurrentStackThroughput":
                ConcurrentStackThroughput(threadCount);
                break;
            case "ConcurrentStackFairness":
                ConcurrentStackFairness(threadCount);
                break;
            case "BarrierSyncRate":
                BarrierSyncRate(threadCount);
                break;
            case "CountdownEventSyncRate":
                CountdownEventSyncRate(threadCount);
                break;
            case "ThreadPoolSustainedWorkThroughput":
                ThreadPoolSustainedWorkThroughput(threadCount);
                break;
            case "ThreadPoolBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    ThreadPoolBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "TaskSustainedWorkThroughput":
                TaskSustainedWorkThroughput(threadCount);
                break;
            case "TaskBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    TaskBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "MonitorEnterExitThroughput_ThinLock":
                MonitorEnterExitThroughput(1, false, false);
                break;
            case "MonitorEnterExitThroughput_AwareLock":
                MonitorEnterExitThroughput(1, false, true);
                break;
            case "MonitorReliableEnterExitThroughput_ThinLock":
                MonitorReliableEnterExitThroughput(1, false, false);
                break;
            case "MonitorReliableEnterExitThroughput_AwareLock":
                MonitorReliableEnterExitThroughput(1, false, true);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_ThinLock":
                MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_AwareLock":
                MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_ThinLock":
                MonitorTryEnterWhenLockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_AwareLock":
                MonitorTryEnterWhenLockedThroughput_AwareLock(1);
                break;
            case "MonitorReliableEnterExitLatency":
                MonitorReliableEnterExitLatency(threadCount);
                break;
            case "MonitorEnterExitThroughput":
                MonitorEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExitThroughput":
                MonitorReliableEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorTryEnterExitThroughput":
                MonitorTryEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExit1PcTOtherWorkThroughput":
                MonitorReliableEnterExit1PcTOtherWorkThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitRoundRobinThroughput":
                MonitorReliableEnterExitRoundRobinThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitFairness":
                MonitorReliableEnterExitFairness(threadCount);
                break;
            case "BufferMemoryCopyThroughput":
                BufferMemoryCopyThroughput(threadCount);
                break;
        }
    }

    [ThreadStatic]
    private static Random t_rng;

    private static void MresWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    e.Reset();
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                e.Set();
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MresWaitLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var continueWaitThreads = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ParameterizedThreadStart waitThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            while (true)
            {
                e.Wait();
                ++localThreadOperationCounts[threadIndex];
                continueWaitThreads.WaitOne();
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                continueWaitThreads.Reset();
                e.Set();
                e.Reset();
                continueWaitThreads.Set();
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(0);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                ss.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ss.Release(localThreadCount);
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void SemaphoreSlimLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                previousLockThreadId = threadId;
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                previousLockThreadId = threadId;
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCb.Add(threadId);
                Delay(d0);
                int item;
                localCb.TryTake(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCb.Add(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCb.TryTake(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentQueueThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCq.Enqueue(threadId);
                Delay(d0);
                int item;
                localCq.TryDequeue(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentQueueFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCq.Enqueue(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCq.TryDequeue(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentStackThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCs.Push(threadId);
                Delay(d0);
                int item;
                localCs.TryPop(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentStackFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCs.Push(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCs.TryPop(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void BarrierSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var b = new Barrier(threadCount);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localB = b;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localB.SignalAndWait();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void CountdownEventSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cde = new CountdownEvent(threadCount * 2);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localCde = cde;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localCde.Signal(2);
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localCde.Reset(localThreadCount * 2);
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ThreadPoolSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            ThreadPool.QueueUserWorkItem(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        var producerThread = new Thread(() =>
        {
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount * 4;
            for (int i = 0; i < initialWorkItemCount; ++i)
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void ThreadPoolBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var workComplete = new AutoResetEvent(false);
        int workItemCountToQueue = 0;
        int workItemCountToComplete = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            int n = Interlocked.Add(ref workItemCountToQueue, -2);
            if (n >= -1)
            {
                var localWorkItemStart = workItemStart;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                if (n >= 0)
                    ThreadPool.QueueUserWorkItem(localWorkItemStart);
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (Interlocked.Decrement(ref workItemCountToComplete) == 0)
                workComplete.Set();
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                workItemCountToQueue = localMaxWorkItemCount - 1;
                workItemCountToComplete = localMaxWorkItemCount;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                workComplete.WaitOne();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action workItemStart = null;
        workItemStart = () =>
        {
            Task.Run(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        Action initialWorkItemStart = () =>
        {
            var localWorkItemStart = workItemStart;
            for (int i = 0; i < 4; ++i)
                Task.Run(localWorkItemStart);
        };

        var producerThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var localInitialWorkItemStart = initialWorkItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount;
            for (int i = 0; i < initialWorkItemCount; ++i)
                Task.Run(localInitialWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action<object> workItemStart = null;
        workItemStart = async data =>
        {
            Task t0 = null, t1 = null;
            int toQueue = (int)data;
            if (toQueue > 1)
            {
                var localWorkItemStart = workItemStart;
                --toQueue;
                t0 = new Task(localWorkItemStart, toQueue - toQueue / 2);
                t0.Start();
                t1 = new Task(localWorkItemStart, toQueue / 2);
                t1.Start();
            }
            else if (toQueue != 0)
            {
                t0 = new Task(workItemStart, 0);
                t0.Start();
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (t0 != null)
            {
                await t0;
                if (t1 != null)
                    await t1;
            }
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                var t = new Task(localWorkItemStart, localMaxWorkItemCount - 1);
                t.Start();
                t.Wait();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MonitorReliableEnterExitLatency(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                {
                    previousLockThreadId = threadId;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    lock (localM)
                        Delay(d0);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    lock (localM)
                    {
                    }
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    Monitor.Enter(localM);
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    Monitor.Enter(localM);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExit1PcTOtherWorkThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var otherWorkThreadOperationCounts = new int[(ProcessorCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random((int)data);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        ParameterizedThreadStart otherWorkThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localOtherWorkThreadOperationCounts = otherWorkThreadOperationCounts;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ++localOtherWorkThreadOperationCounts[threadIndex];
            }
        };
        var otherWorkThreads = new Thread[ProcessorCount];
        for (int i = 0; i < otherWorkThreads.Length; ++i)
        {
            var t = new Thread(otherWorkThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            otherWorkThreads[i] = t;
        }

        RunWithOtherWork(startTest, threadOperationCounts, otherWorkThreadOperationCounts);
    }

    private static void MonitorReliableEnterExitRoundRobinThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var mutexes = new object[threadCount];
        for (int i = 0; i < mutexes.Length; ++i)
            mutexes[i] = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localMutexes = mutexes;
            int mutexCount = localMutexes.Length;
            int mutexIndex = (threadIndex / 16 - 1) % mutexCount;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localMutexes[mutexIndex])
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                mutexIndex = (mutexIndex + 1) % mutexCount;
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                long stopTicks;
                lock (localM)
                {
                    stopTicks = Clock.Ticks;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            Monitor.Enter(localM);
            Monitor.Exit(localM);
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);
        Monitor.Exit(m);

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterWhenLockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static void MonitorTryEnterWhenLockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            if (Monitor.TryEnter(localM, 50))
                return;
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static unsafe void BufferMemoryCopyThroughput(int maxBytes)
    {
        const int threadCount = 1;
        int minBytes = maxBytes <= 8 ? 1 : maxBytes / 2 + 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(0);
            var src = stackalloc byte[maxBytes];
            var dst = stackalloc byte[maxBytes];
            for (int i = 0; i < maxBytes; ++i)
                src[i] = (byte)rng.Next();
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Buffer.MemoryCopy(src, dst, maxBytes, rng.Next(minBytes, maxBytes + 1));
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts, iterations: 1);
    }

    private static void Run(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        bool hasOneResult = false,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var operationCounts = new long[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];

            double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunWithOtherWork(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        int[] otherWorkThreadOperationCounts,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        int otherWorkThreadCount = otherWorkThreadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var otherWorkAfterWarmupOperationCounts = new long[otherWorkThreadCount];
        var operationCounts = new long[threadCount];
        var otherWorkOperationCounts = new long[otherWorkThreadCount];
        var operationCountSums = new double[2];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < afterWarmupOperationCounts.Length; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkAfterWarmupOperationCounts.Length; ++i)
                otherWorkAfterWarmupOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] -= otherWorkAfterWarmupOperationCounts[i];

            operationCountSums[0] = operationCounts.Sum();
            operationCountSums[1] = otherWorkOperationCounts.Sum();
            double score = operationCountSums.GeometricMean(1, otherWorkThreadCount) / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunFairness(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        double[] threadWaitDurationsUs,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadWaitDurationsUs.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var afterWarmupWaitDurationsUs = new double[threadCount];
        var operationCounts = new long[threadCount];
        var waitDurationsUs = new double[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                afterWarmupWaitDurationsUs[i] = threadWaitDurationsUs[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
            {
                int ti = (i + 1) * 16;
                operationCounts[i] = threadOperationCounts[ti];
                waitDurationsUs[i] = threadWaitDurationsUs[ti];
            }
            for (int i = 0; i < threadCount; ++i)
            {
                operationCounts[i] -= afterWarmupOperationCounts[i];
                waitDurationsUs[i] -= afterWarmupWaitDurationsUs[i];
            }

            double averageWaitDurationUs = Math.Sqrt(waitDurationsUs.Sum() / operationCounts.Sum());
            if (averageWaitDurationUs < 1)
                averageWaitDurationUs = 1;
            double score = 100_000 / averageWaitDurationUs;
            Console.WriteLine($"Score: {score:0.000000}");
        }
    }

    private static double BiasWaitDurationUsAgainstLongWaits(double waitDurationUs) =>
        waitDurationUs <= 1 ? 1 : waitDurationUs * waitDurationUs;

    internal static class Clock
    {
        private static readonly long s_swFrequency = Stopwatch.Frequency;
        private static readonly double s_swFrequencyDouble = s_swFrequency;

        public static long Ticks => Stopwatch.GetTimestamp();
        public static double TicksToS(long ticks) => ticks / s_swFrequencyDouble;
        public static double TicksToMs(long ticks) => ticks * 1000 / s_swFrequencyDouble;
        public static double TicksToUs(long ticks) => ticks * (1000 * 1000) / s_swFrequencyDouble;
    }

    private static uint RandomShortDelay(Random rng) => (uint)rng.Next(4, 10);
    private static uint RandomMediumDelay(Random rng) => (uint)rng.Next(10, 15);
    private static uint RandomLongDelay(Random rng) => (uint)rng.Next(15, 20);

    private static int[] s_delayValues = new int[32];

    private static void Delay(uint n)
    {
        Interlocked.MemoryBarrier();
        s_delayValues[16] += (int)Fib(n);
    }

    private static uint Fib(uint n)
    {
        if (n <= 1)
            return n;
        return Fib(n - 2) + Fib(n - 1);
    }
}

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Code used for ReaderWriterLockSlim perf:

            var sw = new Stopwatch();
            var scores = new double[16];
            var startThreads = new ManualResetEvent(false);
            bool stop = false;

            var counts = new int[64];
            var readerThreads = new Thread[readerThreadCount];
            ThreadStart readThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterReadLock();
                        rw.ExitReadLock();
                        Interlocked.Increment(ref counts[16]);
                    }
                };
            for (int i = 0; i < readerThreadCount; ++i)
            {
                readerThreads[i] = new Thread(readThreadStart);
                readerThreads[i].IsBackground = true;
                readerThreads[i].Start();
            }

            var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
            var writerThreads = new Thread[writerThreadCount];
            ThreadStart writeThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterWriteLock();
                        rw.ExitWriteLock();
                        Interlocked.Increment(ref counts[32]);
                    }
                };
            for (int i = 0; i < writerThreadCount; ++i)
            {
                writerThreads[i] = new Thread(writeThreadStart);
                writerThreads[i].IsBackground = true;
                writerThreads[i].Start();
            }

            startThreads.Set();

            // Warmup

            Thread.Sleep(4000);

            // Actual run
            for(int i = 0; i < scores.Length; ++i)
            {
                counts[16] = 0;
                counts[32] = 0;
                Interlocked.MemoryBarrier();

                sw.Restart();
                Thread.Sleep(500);
                sw.Stop();

                int readCount = counts[16];
                int writeCount = counts[32];

                double elapsedMs = sw.Elapsed.TotalMilliseconds;
                scores[i] =
                    new double[]
                    {
                        Math.Max(1, (readCount + writeCount) / elapsedMs),
                        Math.Max(1, writeCount / elapsedMs)
                    }.GeometricMean(readerThreadCount, writerThreadCount);
            }

            return scores;

@stephentoub
Copy link
Member

stephentoub commented Aug 30, 2017

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

ThreadPool's global queue is a ConcurrentQueue, and CQ uses System.Threading.SpinWait when there are contentions on various operations, including dequeues.

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Ah ok, I included ConcurrentQueue, I'll add a test for thread pool as well

@kouvel
Copy link
Member Author

kouvel commented Aug 30, 2017

Updated code above with the added thread pool throughput test. Looks like there's no change:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    7322.26 ±0.65%  7443.96 ±0.73%   121.71      1.66%
ThreadPoolThroughput 2Pc                    7377.70 ±0.63%  7467.42 ±0.82%    89.72      1.22%
ThreadPoolThroughput 4Pc                    7329.01 ±0.75%  7330.87 ±1.00%     1.86      0.03%
------------------------------------------  --------------  --------------  -------  ---------

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    9434.79 ±0.55%  9484.14 ±0.54%    49.35      0.52%
ThreadPoolThroughput 2Pc                    9384.44 ±0.41%  9376.15 ±0.41%    -8.30     -0.09%
ThreadPoolThroughput 4Pc                    9390.46 ±0.62%  9387.43 ±0.75%    -3.03     -0.03%
------------------------------------------  --------------  --------------  -------  ---------

@kouvel
Copy link
Member Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel
Copy link
Member Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

@kouvel
Copy link
Member Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel
Copy link
Member Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel kouvel requested a review from tarekgh August 31, 2017 09:54
/// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled
/// when the resource becomes available.
/// </summary>
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35 [](start = 105, length = 2)

did we get this number from experimenting different scenarios? just curious how we come up with this number. and it doesn't matter the number of processors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with ManualResetEventSlim to get an initial number, applied the same number to other similar situations, and then tweaked up and down to see what was working. Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use up some CPU time unnecessarily. Depends on the situation too, like for SemaphoreSlim I had to double the spin iterations because the waiting there is a lot more expensive. Also depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted for here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think including number of processors (N) works well. Multiplying by N increases spinning on each thread by N, so total spinning across N threads is increased by N^2. When there are more processors contending on a resource, it may even be better to spin less and wait sooner to reduce contention since with more processors something like a mutex has the natural possibility of having more contention.

// usually better for that.
//
int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration;
if (_count <= 30 && (1 << _count) < n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 [](start = 30, length = 2)

would be nice to comment how we choose this number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a reference to Thread::InitializeYieldProcessorNormalized that describes and calculates it

{
get
{
if (s_optimalMaxSpinWaitsPerSpinIteration != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s_optimalMaxSpinWaitsPerSpinIteration [](start = 20, length = 37)

Looks this one can be converted to readonly field initialized with GetOptimalMaxSpinWaitsPerSpinIterationInternal() so we can avoid checking 0 value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want do that since the first call would trigger the measurement that takes about 10 ms. Static construction of RuntimeThread probably happens during startup for most apps.

}

return IsCompleted;
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false; [](start = 11, length = 14)

Is it possible between exiting the loop and executing the return, the task can get into completed state? I am asking to know if we should keep returning IsCompleted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally it doesn't make any difference, the caller will do the right thing. Previously it made sense to check IsCompleted before returning because the loop would have stopped immediately after a wait. But previously it was redundant to check IsCompleted first in the loop because it was already checked immediately before before the loop. So I just changed the loop to wait first and check later, now the loop exits right after checking IsCompleted and it would be redundant to check it again before returning.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@kouvel kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 31, 2017
@kouvel kouvel merged commit 03bf95c into dotnet:master Sep 1, 2017
@kouvel kouvel deleted the SpinFix branch September 1, 2017 20:09
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 2, 2017
In dotnet#13670, by mistake I made the spin loop infinite, that is now fixed.

As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem:
- Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state.
- Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary.
- Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
kouvel added a commit that referenced this pull request Sep 5, 2017
In #13670, by mistake I made the spin loop infinite, that is now fixed.

As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem:
- Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state.
- Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary.
- Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 12, 2017
Closes https://github.com/dotnet/coreclr/issues/5928

Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore
- UnfairSemaphore had a some benefits:
  - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners
  - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order.
  - Spinning helps with throughput when incoming work is bursty
- All of the above benefits were retained in CLRLifoSemaphore and some were improved:
  - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters
  - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently.
  - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters
  - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners.
  - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors.

Perf results

For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%.

Updated spin perf code here: dotnet#13670
- NPc = (N * proc count) threads
- MPcWi = (M * proc count) work items
- BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats
- SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero

```
Spin                                          Left score      Right score     ∆ Score %
--------------------------------------------  --------------  --------------  ---------
ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi   276.10 ±1.09%   268.90 ±1.36%     -2.61%
ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi   362.63 ±0.47%   388.82 ±0.33%      7.22%
ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi   498.33 ±0.32%   797.01 ±0.29%     59.94%
ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi  1222.52 ±0.42%  1348.78 ±0.47%     10.33%
ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi  1672.72 ±0.48%  1724.06 ±0.47%      3.07%
ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi  1853.94 ±0.25%  1868.36 ±0.45%      0.78%
ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi  1849.30 ±0.24%  1902.58 ±0.48%      2.88%
ThreadPoolSustainedWorkThroughput 1Pc         1495.62 ±0.78%  1505.89 ±0.20%      0.69%
--------------------------------------------  --------------  --------------  ---------
Total                                          922.22 ±0.51%  1004.59 ±0.51%      8.93%
```

Numbers on Linux were similar with a slightly different spread and no regressions.

I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 17, 2017
Closes https://github.com/dotnet/coreclr/issues/5928

Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore
- UnfairSemaphore had a some benefits:
  - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners
  - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order.
  - Spinning helps with throughput when incoming work is bursty
- All of the above benefits were retained in CLRLifoSemaphore and some were improved:
  - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters
  - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently.
  - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters
  - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners.
  - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors.

Perf results

For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%.

Updated spin perf code here: dotnet#13670
- NPc = (N * proc count) threads
- MPcWi = (M * proc count) work items
- BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats
- SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero

```
Spin                                          Left score      Right score     ∆ Score %
--------------------------------------------  --------------  --------------  ---------
ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi   276.10 ±1.09%   268.90 ±1.36%     -2.61%
ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi   362.63 ±0.47%   388.82 ±0.33%      7.22%
ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi   498.33 ±0.32%   797.01 ±0.29%     59.94%
ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi  1222.52 ±0.42%  1348.78 ±0.47%     10.33%
ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi  1672.72 ±0.48%  1724.06 ±0.47%      3.07%
ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi  1853.94 ±0.25%  1868.36 ±0.45%      0.78%
ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi  1849.30 ±0.24%  1902.58 ±0.48%      2.88%
ThreadPoolSustainedWorkThroughput 1Pc         1495.62 ±0.78%  1505.89 ±0.20%      0.69%
--------------------------------------------  --------------  --------------  ---------
Total                                          922.22 ±0.51%  1004.59 ±0.51%      8.93%
```

Numbers on Linux were similar with a slightly different spread and no regressions.

I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 19, 2017
Closes https://github.com/dotnet/coreclr/issues/5928

Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore
- UnfairSemaphore had a some benefits:
  - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners
  - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order.
  - Spinning helps with throughput when incoming work is bursty
- All of the above benefits were retained in CLRLifoSemaphore and some were improved:
  - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters
  - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently.
  - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters
  - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners.
  - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors.

Perf results

For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%.

Updated spin perf code here: dotnet#13670
- NPc = (N * proc count) threads
- MPcWi = (M * proc count) work items
- BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats
- SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero

```
Spin                                          Left score      Right score     ∆ Score %
--------------------------------------------  --------------  --------------  ---------
ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi   276.10 ±1.09%   268.90 ±1.36%     -2.61%
ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi   362.63 ±0.47%   388.82 ±0.33%      7.22%
ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi   498.33 ±0.32%   797.01 ±0.29%     59.94%
ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi  1222.52 ±0.42%  1348.78 ±0.47%     10.33%
ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi  1672.72 ±0.48%  1724.06 ±0.47%      3.07%
ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi  1853.94 ±0.25%  1868.36 ±0.45%      0.78%
ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi  1849.30 ±0.24%  1902.58 ±0.48%      2.88%
ThreadPoolSustainedWorkThroughput 1Pc         1495.62 ±0.78%  1505.89 ±0.20%      0.69%
--------------------------------------------  --------------  --------------  ---------
Total                                          922.22 ±0.51%  1004.59 ±0.51%      8.93%
```

Numbers on Linux were similar with a slightly different spread and no regressions.

I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 23, 2017
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely

Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
  - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
  - Register usage
    - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
    - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
    - I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR dotnet#13670

My guess is that these regressions are small and unlikely to materialize into real-world regressions. It would simplify and ease maintenance a bit to remove the asm, but since it looks like the register allocation issues would not be resolved easily, I'm not sure if we want to remove the asm code at this time. @jkotas and @vancem, thoughts?

Numbers (no PGO):

Windows x64

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          800.56 ±0.33%    821.97 ±0.30%      2.67%
MonitorEnterExitLatency 4T                         1533.25 ±0.34%   1553.82 ±0.13%      1.34%
MonitorEnterExitLatency 7T                         1676.14 ±0.26%   1678.14 ±0.18%      0.12%
MonitorEnterExitThroughput Delay 1T                5174.77 ±0.25%   5125.56 ±0.27%     -0.95%
MonitorEnterExitThroughput Delay 2T                4982.38 ±0.22%   4937.79 ±0.19%     -0.90%
MonitorEnterExitThroughput Delay 4T                4720.41 ±0.37%   4694.09 ±0.24%     -0.56%
MonitorEnterExitThroughput Delay 7T                3741.20 ±0.33%   3778.06 ±0.20%      0.99%
MonitorEnterExitThroughput_AwareLock 1T           63445.04 ±0.20%  61540.28 ±0.23%     -3.00%
MonitorEnterExitThroughput_ThinLock 1T            59720.83 ±0.20%  59754.62 ±0.12%      0.06%
MonitorReliableEnterExitLatency 2T                  809.31 ±0.23%    809.58 ±0.41%      0.03%
MonitorReliableEnterExitLatency 4T                 1569.47 ±0.45%   1577.43 ±0.71%      0.51%
MonitorReliableEnterExitLatency 7T                 1681.65 ±0.25%   1678.01 ±0.20%     -0.22%
MonitorReliableEnterExitThroughput Delay 1T        4956.40 ±0.41%   4957.46 ±0.24%      0.02%
MonitorReliableEnterExitThroughput Delay 2T        4794.52 ±0.18%   4756.23 ±0.25%     -0.80%
MonitorReliableEnterExitThroughput Delay 4T        4560.22 ±0.25%   4522.03 ±0.35%     -0.84%
MonitorReliableEnterExitThroughput Delay 7T        3902.19 ±0.55%   3875.81 ±0.13%     -0.68%
MonitorReliableEnterExitThroughput_AwareLock 1T   61944.11 ±0.20%  58083.95 ±0.08%     -6.23%
MonitorReliableEnterExitThroughput_ThinLock 1T    59632.31 ±0.25%  58972.48 ±0.07%     -1.11%
MonitorTryEnterExitThroughput_AwareLock 1T        62345.13 ±0.14%  57159.99 ±0.14%     -8.32%
MonitorTryEnterExitThroughput_ThinLock 1T         59725.76 ±0.15%  58050.35 ±0.16%     -2.81%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              6795.49 ±0.28%   6723.21 ±0.23%     -1.06%
```

Windows x86

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          958.97 ±0.37%    987.28 ±0.32%      2.95%
MonitorEnterExitLatency 4T                         1675.18 ±0.41%   1704.64 ±0.08%      1.76%
MonitorEnterExitLatency 7T                         1825.49 ±0.09%   1769.50 ±0.12%     -3.07%
MonitorEnterExitThroughput Delay 1T                5083.01 ±0.27%   5047.10 ±0.37%     -0.71%
MonitorEnterExitThroughput Delay 2T                4854.54 ±0.13%   4825.31 ±0.14%     -0.60%
MonitorEnterExitThroughput Delay 4T                4628.89 ±0.17%   4579.92 ±0.56%     -1.06%
MonitorEnterExitThroughput Delay 7T                4125.52 ±0.48%   4096.78 ±0.20%     -0.70%
MonitorEnterExitThroughput_AwareLock 1T           61841.28 ±0.13%  57429.31 ±0.44%     -7.13%
MonitorEnterExitThroughput_ThinLock 1T            59746.69 ±0.19%  57971.43 ±0.10%     -2.97%
MonitorReliableEnterExitLatency 2T                  983.26 ±0.22%    998.25 ±0.33%      1.52%
MonitorReliableEnterExitLatency 4T                 1758.10 ±0.14%   1723.63 ±0.19%     -1.96%
MonitorReliableEnterExitLatency 7T                 1832.24 ±0.08%   1776.61 ±0.10%     -3.04%
MonitorReliableEnterExitThroughput Delay 1T        5023.19 ±0.05%   4980.49 ±0.08%     -0.85%
MonitorReliableEnterExitThroughput Delay 2T        4846.04 ±0.03%   4792.58 ±0.11%     -1.10%
MonitorReliableEnterExitThroughput Delay 4T        4608.14 ±0.09%   4574.90 ±0.06%     -0.72%
MonitorReliableEnterExitThroughput Delay 7T        4123.20 ±0.10%   4075.92 ±0.11%     -1.15%
MonitorReliableEnterExitThroughput_AwareLock 1T   57951.11 ±0.11%  57006.12 ±0.21%     -1.63%
MonitorReliableEnterExitThroughput_ThinLock 1T    58006.06 ±0.18%  58018.28 ±0.07%      0.02%
MonitorTryEnterExitThroughput_AwareLock 1T        60701.63 ±0.04%  53374.77 ±0.15%    -12.07%
MonitorTryEnterExitThroughput_ThinLock 1T         58169.82 ±0.05%  56023.58 ±0.69%     -3.69%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              7037.46 ±0.17%   6906.42 ±0.22%     -1.86%
```

Linux x64

```
Spin repeater                                    Left score       Right score      ∆ Score %
-----------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                        3755.92 ±1.51%   3802.80 ±0.62%      1.25%
MonitorEnterExitLatency 4T                        3448.14 ±1.69%   3493.84 ±1.58%      1.33%
MonitorEnterExitLatency 7T                        2593.97 ±0.13%   2655.21 ±0.15%      2.36%
MonitorEnterExitThroughput Delay 1T               4854.52 ±0.12%   4873.08 ±0.11%      0.38%
MonitorEnterExitThroughput Delay 2T               4659.19 ±0.85%   4695.61 ±0.38%      0.78%
MonitorEnterExitThroughput Delay 4T               4163.01 ±1.46%   4190.94 ±1.37%      0.67%
MonitorEnterExitThroughput Delay 7T               3012.69 ±0.45%   3123.75 ±0.32%      3.69%
MonitorEnterExitThroughput_AwareLock 1T          56665.09 ±0.16%  58524.86 ±0.24%      3.28%
MonitorEnterExitThroughput_ThinLock 1T           57476.36 ±0.68%  57573.08 ±0.61%      0.17%
MonitorReliableEnterExitLatency 2T                3952.35 ±0.45%   3937.80 ±0.49%     -0.37%
MonitorReliableEnterExitLatency 4T                3001.75 ±1.02%   3008.55 ±0.76%      0.23%
MonitorReliableEnterExitLatency 7T                2456.20 ±0.65%   2479.78 ±0.09%      0.96%
MonitorReliableEnterExitThroughput Delay 1T       4907.10 ±0.85%   4940.83 ±0.23%      0.69%
MonitorReliableEnterExitThroughput Delay 2T       4750.81 ±0.62%   4725.81 ±0.87%     -0.53%
MonitorReliableEnterExitThroughput Delay 4T       4329.93 ±1.18%   4360.67 ±1.04%      0.71%
MonitorReliableEnterExitThroughput Delay 7T       3180.52 ±0.27%   3255.88 ±0.51%      2.37%
MonitorReliableEnterExitThroughput_AwareLock 1T  54559.89 ±0.09%  55785.74 ±0.20%      2.25%
MonitorReliableEnterExitThroughput_ThinLock 1T   55936.06 ±0.36%  55519.74 ±0.80%     -0.74%
MonitorTryEnterExitThroughput_AwareLock 1T       52694.96 ±0.18%  54282.77 ±0.12%      3.01%
MonitorTryEnterExitThroughput_ThinLock 1T        54942.18 ±0.24%  55031.84 ±0.38%      0.16%
-----------------------------------------------  ---------------  ---------------  ---------
Total                                             8326.45 ±0.65%   8420.07 ±0.54%      1.12%
```
kouvel added a commit to kouvel/coreclr that referenced this pull request Sep 23, 2017
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely

Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
  - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
  - Register usage
    - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
    - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
    - I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR dotnet#13670

Numbers (no PGO):

Windows x64

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          800.56 ±0.33%    821.97 ±0.30%      2.67%
MonitorEnterExitLatency 4T                         1533.25 ±0.34%   1553.82 ±0.13%      1.34%
MonitorEnterExitLatency 7T                         1676.14 ±0.26%   1678.14 ±0.18%      0.12%
MonitorEnterExitThroughput Delay 1T                5174.77 ±0.25%   5125.56 ±0.27%     -0.95%
MonitorEnterExitThroughput Delay 2T                4982.38 ±0.22%   4937.79 ±0.19%     -0.90%
MonitorEnterExitThroughput Delay 4T                4720.41 ±0.37%   4694.09 ±0.24%     -0.56%
MonitorEnterExitThroughput Delay 7T                3741.20 ±0.33%   3778.06 ±0.20%      0.99%
MonitorEnterExitThroughput_AwareLock 1T           63445.04 ±0.20%  61540.28 ±0.23%     -3.00%
MonitorEnterExitThroughput_ThinLock 1T            59720.83 ±0.20%  59754.62 ±0.12%      0.06%
MonitorReliableEnterExitLatency 2T                  809.31 ±0.23%    809.58 ±0.41%      0.03%
MonitorReliableEnterExitLatency 4T                 1569.47 ±0.45%   1577.43 ±0.71%      0.51%
MonitorReliableEnterExitLatency 7T                 1681.65 ±0.25%   1678.01 ±0.20%     -0.22%
MonitorReliableEnterExitThroughput Delay 1T        4956.40 ±0.41%   4957.46 ±0.24%      0.02%
MonitorReliableEnterExitThroughput Delay 2T        4794.52 ±0.18%   4756.23 ±0.25%     -0.80%
MonitorReliableEnterExitThroughput Delay 4T        4560.22 ±0.25%   4522.03 ±0.35%     -0.84%
MonitorReliableEnterExitThroughput Delay 7T        3902.19 ±0.55%   3875.81 ±0.13%     -0.68%
MonitorReliableEnterExitThroughput_AwareLock 1T   61944.11 ±0.20%  58083.95 ±0.08%     -6.23%
MonitorReliableEnterExitThroughput_ThinLock 1T    59632.31 ±0.25%  58972.48 ±0.07%     -1.11%
MonitorTryEnterExitThroughput_AwareLock 1T        62345.13 ±0.14%  57159.99 ±0.14%     -8.32%
MonitorTryEnterExitThroughput_ThinLock 1T         59725.76 ±0.15%  58050.35 ±0.16%     -2.81%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              6795.49 ±0.28%   6723.21 ±0.23%     -1.06%
```

Windows x86

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          958.97 ±0.37%    987.28 ±0.32%      2.95%
MonitorEnterExitLatency 4T                         1675.18 ±0.41%   1704.64 ±0.08%      1.76%
MonitorEnterExitLatency 7T                         1825.49 ±0.09%   1769.50 ±0.12%     -3.07%
MonitorEnterExitThroughput Delay 1T                5083.01 ±0.27%   5047.10 ±0.37%     -0.71%
MonitorEnterExitThroughput Delay 2T                4854.54 ±0.13%   4825.31 ±0.14%     -0.60%
MonitorEnterExitThroughput Delay 4T                4628.89 ±0.17%   4579.92 ±0.56%     -1.06%
MonitorEnterExitThroughput Delay 7T                4125.52 ±0.48%   4096.78 ±0.20%     -0.70%
MonitorEnterExitThroughput_AwareLock 1T           61841.28 ±0.13%  57429.31 ±0.44%     -7.13%
MonitorEnterExitThroughput_ThinLock 1T            59746.69 ±0.19%  57971.43 ±0.10%     -2.97%
MonitorReliableEnterExitLatency 2T                  983.26 ±0.22%    998.25 ±0.33%      1.52%
MonitorReliableEnterExitLatency 4T                 1758.10 ±0.14%   1723.63 ±0.19%     -1.96%
MonitorReliableEnterExitLatency 7T                 1832.24 ±0.08%   1776.61 ±0.10%     -3.04%
MonitorReliableEnterExitThroughput Delay 1T        5023.19 ±0.05%   4980.49 ±0.08%     -0.85%
MonitorReliableEnterExitThroughput Delay 2T        4846.04 ±0.03%   4792.58 ±0.11%     -1.10%
MonitorReliableEnterExitThroughput Delay 4T        4608.14 ±0.09%   4574.90 ±0.06%     -0.72%
MonitorReliableEnterExitThroughput Delay 7T        4123.20 ±0.10%   4075.92 ±0.11%     -1.15%
MonitorReliableEnterExitThroughput_AwareLock 1T   57951.11 ±0.11%  57006.12 ±0.21%     -1.63%
MonitorReliableEnterExitThroughput_ThinLock 1T    58006.06 ±0.18%  58018.28 ±0.07%      0.02%
MonitorTryEnterExitThroughput_AwareLock 1T        60701.63 ±0.04%  53374.77 ±0.15%    -12.07%
MonitorTryEnterExitThroughput_ThinLock 1T         58169.82 ±0.05%  56023.58 ±0.69%     -3.69%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              7037.46 ±0.17%   6906.42 ±0.22%     -1.86%
```

Linux x64

```
Spin repeater                                    Left score       Right score      ∆ Score %
-----------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                        3755.92 ±1.51%   3802.80 ±0.62%      1.25%
MonitorEnterExitLatency 4T                        3448.14 ±1.69%   3493.84 ±1.58%      1.33%
MonitorEnterExitLatency 7T                        2593.97 ±0.13%   2655.21 ±0.15%      2.36%
MonitorEnterExitThroughput Delay 1T               4854.52 ±0.12%   4873.08 ±0.11%      0.38%
MonitorEnterExitThroughput Delay 2T               4659.19 ±0.85%   4695.61 ±0.38%      0.78%
MonitorEnterExitThroughput Delay 4T               4163.01 ±1.46%   4190.94 ±1.37%      0.67%
MonitorEnterExitThroughput Delay 7T               3012.69 ±0.45%   3123.75 ±0.32%      3.69%
MonitorEnterExitThroughput_AwareLock 1T          56665.09 ±0.16%  58524.86 ±0.24%      3.28%
MonitorEnterExitThroughput_ThinLock 1T           57476.36 ±0.68%  57573.08 ±0.61%      0.17%
MonitorReliableEnterExitLatency 2T                3952.35 ±0.45%   3937.80 ±0.49%     -0.37%
MonitorReliableEnterExitLatency 4T                3001.75 ±1.02%   3008.55 ±0.76%      0.23%
MonitorReliableEnterExitLatency 7T                2456.20 ±0.65%   2479.78 ±0.09%      0.96%
MonitorReliableEnterExitThroughput Delay 1T       4907.10 ±0.85%   4940.83 ±0.23%      0.69%
MonitorReliableEnterExitThroughput Delay 2T       4750.81 ±0.62%   4725.81 ±0.87%     -0.53%
MonitorReliableEnterExitThroughput Delay 4T       4329.93 ±1.18%   4360.67 ±1.04%      0.71%
MonitorReliableEnterExitThroughput Delay 7T       3180.52 ±0.27%   3255.88 ±0.51%      2.37%
MonitorReliableEnterExitThroughput_AwareLock 1T  54559.89 ±0.09%  55785.74 ±0.20%      2.25%
MonitorReliableEnterExitThroughput_ThinLock 1T   55936.06 ±0.36%  55519.74 ±0.80%     -0.74%
MonitorTryEnterExitThroughput_AwareLock 1T       52694.96 ±0.18%  54282.77 ±0.12%      3.01%
MonitorTryEnterExitThroughput_ThinLock 1T        54942.18 ±0.24%  55031.84 ±0.38%      0.16%
-----------------------------------------------  ---------------  ---------------  ---------
Total                                             8326.45 ±0.65%   8420.07 ±0.54%      1.12%
```
kouvel added a commit that referenced this pull request Sep 26, 2017
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely

Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
  - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
  - Register usage
    - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
    - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
    - I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR #13670
@kouvel kouvel mentioned this pull request Sep 27, 2017
MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Apr 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants