Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentQueue spending excess time in SpinWait #44077

Closed
alexcovington opened this issue Oct 30, 2020 · 17 comments · Fixed by #44265
Closed

ConcurrentQueue spending excess time in SpinWait #44077

alexcovington opened this issue Oct 30, 2020 · 17 comments · Fixed by #44265
Labels

Comments

@alexcovington
Copy link
Contributor

alexcovington commented Oct 30, 2020

Description

I've noticed that when a ConcurrentQueue instance has many enqueuers/dequeuers, there is a lot of extra time spent in SpinWait.SpinOnce. This seems to be because the SpinWait.SpinOnce call is passing the optional parameter sleep1Threshold: -1, which disables the call to Thread.Sleep that a thread would eventually call after spinning too long.

When I change the parameter from sleep1Threshold: -1 to sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration, I see a significant increase in performance on some of my machines in certain microbenchmark cases. I ran the benchmarks against local builds from the release/5.0-rc2 branch, with and without the change to the sleeping behavior. The microbenchmarks I ran against are from the dotnet/performance repository, and can be reproduced with:

sudo python3 ./script/benchmarks_ci.py -c Release -f netcoreapp5.0 --filter '*ConcurrentQueue*' --corerun $CORERUN_PATH --bdn-artifacts $BDN_ARTIFACTS_DIR

I've also included BenchmarkDotNet results from the dotnet/performance Microbenchmarks to show the effect of the change below.

Configuration

Each machine is an x64 machine running Ubuntu 20.04. More information in the BenchmarkDotNet results.

Regression?

It looks like this was changed to improve performance back in .NET 3.0 based on this merge.

Data

Skylake

Base (SpinOnce(sleep1Threshold: -1))
------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-EQUTZA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-ISAALG : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |            Error |            StdDev |            Median |               Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|-----------------:|------------------:|------------------:|------------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |       ? |          72.31 ns |         0.056 ns |          0.164 ns |          72.24 ns |          72.08 ns |          72.75 ns | 0.1376 |      - |     - |     576 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |       ? |          85.67 ns |         0.249 ns |          0.723 ns |          86.14 ns |          84.69 ns |          86.65 ns | 0.1988 |      - |     - |     832 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |   512 |       ? |          20.71 ns |         0.001 ns |          0.002 ns |          20.71 ns |          20.71 ns |          20.72 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |   512 |       ? |          21.20 ns |         0.001 ns |          0.003 ns |          21.20 ns |          21.19 ns |          21.21 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       7,447.65 ns |         1.621 ns |          4.755 ns |       7,448.25 ns |       7,437.28 ns |       7,460.69 ns | 1.0432 |      - |     - |    4448 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       8,138.52 ns |         1.695 ns |          4.809 ns |       8,139.33 ns |       8,124.25 ns |       8,148.08 ns | 2.0214 | 0.0978 |     - |    8544 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,358.00 ns |         4.003 ns |         11.290 ns |       4,358.73 ns |       4,331.23 ns |       4,383.26 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,604.88 ns |         2.521 ns |          7.272 ns |       5,603.75 ns |       5,586.93 ns |       5,625.49 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       7,811.29 ns |         1.145 ns |          3.190 ns |       7,811.23 ns |       7,799.31 ns |       7,819.26 ns | 2.3137 |      - |     - |    9792 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       8,553.62 ns |         2.325 ns |          6.166 ns |       8,555.10 ns |       8,537.96 ns |       8,569.89 ns | 4.2855 |      - |     - |   17984 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  31,107,778.63 ns | 2,041,343.278 ns |  5,690,449.134 ns |  28,001,835.00 ns |  26,040,926.50 ns |  49,191,892.50 ns |      - |      - |     - |    9168 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  30,865,184.17 ns | 1,633,527.591 ns |  4,444,132.284 ns |  28,404,088.50 ns |  27,454,580.00 ns |  44,963,492.00 ns |      - |      - |     - |  526032 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 187,964,095.59 ns | 4,103,621.184 ns | 11,970,455.607 ns | 189,390,999.50 ns | 144,958,019.50 ns | 212,779,102.50 ns |      - |      - |     - |    1192 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 180,518,506.05 ns | 4,279,085.954 ns | 12,616,981.480 ns | 181,038,632.00 ns | 141,594,659.00 ns | 205,936,025.00 ns |      - |      - |     - |    4008 B |


Diff (SpinOnce(sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration))
----------------------------------------------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-MIAVZY : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-SIKWHO : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |             Mean |            Error |           StdDev |           Median |              Min |              Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |-----------------:|-----------------:|-----------------:|-----------------:|-----------------:|-----------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |       ? |         87.06 ns |         0.049 ns |         0.143 ns |         87.12 ns |         86.76 ns |         87.38 ns | 0.1395 |      - |     - |     584 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |       ? |        101.25 ns |         0.017 ns |         0.043 ns |        101.25 ns |        101.16 ns |        101.42 ns | 0.2007 |      - |     - |     840 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |   512 |       ? |         21.28 ns |         0.001 ns |         0.004 ns |         21.28 ns |         21.27 ns |         21.29 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |   512 |       ? |         21.27 ns |         0.001 ns |         0.002 ns |         21.27 ns |         21.26 ns |         21.27 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      7,717.92 ns |         2.011 ns |         5.834 ns |      7,717.93 ns |      7,704.51 ns |      7,730.39 ns | 1.0483 |      - |     - |    4456 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,386.39 ns |         1.475 ns |         4.185 ns |      8,386.13 ns |      8,378.25 ns |      8,397.01 ns | 2.0161 | 0.1008 |     - |    8552 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      4,867.26 ns |         1.113 ns |         3.229 ns |      4,867.12 ns |      4,862.70 ns |      4,876.69 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      5,658.29 ns |         2.533 ns |         7.469 ns |      5,656.20 ns |      5,647.76 ns |      5,677.85 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,204.22 ns |         1.247 ns |         3.578 ns |      8,204.07 ns |      8,195.18 ns |      8,212.79 ns | 2.3306 |      - |     - |    9840 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,603.39 ns |         2.441 ns |         6.641 ns |      8,601.63 ns |      8,596.32 ns |      8,626.47 ns | 4.3044 |      - |     - |   18032 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 31,377,565.45 ns | 1,724,624.611 ns | 4,920,451.059 ns | 29,505,311.50 ns | 26,032,254.00 ns | 44,129,165.00 ns |      - |      - |     - |  526880 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 31,061,566.93 ns | 1,654,339.740 ns | 4,638,945.997 ns | 28,762,799.00 ns | 27,510,170.00 ns | 45,084,255.00 ns |      - |      - |     - | 1050656 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 83,496,964.09 ns |   415,818.700 ns | 1,166,000.215 ns | 83,076,257.00 ns | 81,850,532.00 ns | 86,766,334.00 ns |      - |      - |     - |     424 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 88,636,063.77 ns |   350,925.444 ns | 1,006,870.882 ns | 88,184,104.00 ns | 87,145,891.00 ns | 91,002,926.00 ns |      - |      - |     - |     424 B |


Comparison
----------
summary:
better: 2, geomean: 2.148
worse: 3, geomean: 1.039
total diff: 5

| Slower                                                                   | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------------ | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.CreateAddAndClear<Int32>.ConcurrentQueue(Size: 512)   |      1.05 |          7804.02 |          8202.04 |         |
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)  |      1.04 |          7442.15 |          7713.63 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512) |      1.03 |          8136.48 |          8383.06 |         |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      2.24 |     187936495.50 |      83897866.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      2.06 |     182276745.00 |      88472010.00 |         |

Ryzen

Base (SpinOnce(sleep1Threshold: -1))
------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-YGFIWA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-JNBMSA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |             Error |             StdDev |            Median |               Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|------------------:|-------------------:|------------------:|------------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |       ? |          85.04 ns |          0.543 ns |           1.577 ns |          85.18 ns |          82.04 ns |          88.88 ns | 0.0342 |      - |     - |     576 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |       ? |          95.75 ns |          0.059 ns |           0.165 ns |          95.74 ns |          95.22 ns |          96.24 ns | 0.0496 |      - |     - |     832 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |   512 |       ? |          12.31 ns |          0.003 ns |           0.007 ns |          12.31 ns |          12.30 ns |          12.33 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |   512 |       ? |          13.45 ns |          0.079 ns |           0.234 ns |          13.55 ns |          12.95 ns |          13.87 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,151.24 ns |          1.689 ns |           4.764 ns |       5,149.44 ns |       5,144.51 ns |       5,165.32 ns | 0.2479 |      - |     - |    4448 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,992.60 ns |          6.465 ns |          18.340 ns |       5,986.52 ns |       5,972.87 ns |       6,039.16 ns | 0.5035 | 0.0240 |     - |    8544 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,392.32 ns |          0.578 ns |           1.611 ns |       4,392.13 ns |       4,389.04 ns |       4,397.64 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,370.14 ns |          0.403 ns |           1.163 ns |       5,369.99 ns |       5,367.94 ns |       5,373.05 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,042.72 ns |          2.979 ns |           8.401 ns |       5,040.23 ns |       5,028.93 ns |       5,066.91 ns | 0.5662 |      - |     - |    9792 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,811.73 ns |          4.256 ns |          12.142 ns |       5,808.91 ns |       5,789.65 ns |       5,846.74 ns | 1.0692 | 0.0697 |     - |   17984 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  19,289,620.33 ns |  1,373,898.353 ns |   3,807,065.866 ns |  19,122,269.00 ns |  14,069,073.00 ns |  30,219,178.00 ns |      - |      - |     - | 2100176 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  23,454,660.90 ns |    781,060.146 ns |   2,071,260.706 ns |  23,092,573.50 ns |  18,199,132.00 ns |  30,285,417.00 ns |      - |      - |     - |   33776 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 567,072,512.67 ns | 64,097,635.471 ns | 188,993,324.379 ns | 659,905,224.50 ns |  63,575,588.00 ns | 755,952,744.00 ns |      - |      - |     - |    9128 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 595,239,541.73 ns | 54,895,214.567 ns | 161,859,778.713 ns | 685,948,586.00 ns | 134,833,576.50 ns | 783,141,323.50 ns |      - |      - |     - |   16808 B |


Diff (SpinOnce(sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration))
----------------------------------------------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-AHLXGP : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-OCAWOY : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |             Error |            StdDev |            Median |              Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|------------------:|------------------:|------------------:|-----------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |       ? |          81.89 ns |          0.501 ns |          1.445 ns |          81.75 ns |         78.85 ns |          85.34 ns | 0.0347 |      - |     - |     584 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |       ? |          98.97 ns |          0.164 ns |          0.470 ns |          98.98 ns |         98.20 ns |         100.32 ns | 0.0500 |      - |     - |     840 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |   512 |       ? |          12.36 ns |          0.011 ns |          0.032 ns |          12.35 ns |         12.32 ns |          12.42 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |   512 |       ? |          13.69 ns |          0.035 ns |          0.102 ns |          13.69 ns |         13.38 ns |          13.95 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,160.01 ns |          1.392 ns |          3.764 ns |       5,159.72 ns |      5,149.85 ns |       5,169.32 ns | 0.2483 |      - |     - |    4456 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       6,018.71 ns |         10.216 ns |         29.638 ns |       6,007.39 ns |      5,985.17 ns |       6,104.02 ns | 0.5048 | 0.0240 |     - |    8552 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,390.67 ns |          0.744 ns |          2.097 ns |       4,389.89 ns |      4,387.79 ns |       4,397.29 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,369.40 ns |          0.349 ns |          0.979 ns |       5,369.37 ns |      5,367.52 ns |       5,372.03 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,069.14 ns |          2.806 ns |          7.869 ns |       5,069.77 ns |      5,053.56 ns |       5,086.62 ns | 0.5684 |      - |     - |    9840 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,833.39 ns |          4.384 ns |         12.508 ns |       5,833.01 ns |      5,786.71 ns |       5,865.83 ns | 1.0740 | 0.0700 |     - |   18032 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  18,958,972.16 ns |  1,357,806.277 ns |  3,851,870.408 ns |  18,124,957.50 ns | 14,039,125.50 ns |  30,775,269.50 ns |      - |      - |     - |  527168 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  22,155,658.25 ns |    955,012.487 ns |  2,646,335.104 ns |  22,400,435.00 ns | 14,281,531.00 ns |  28,474,005.00 ns |      - |      - |     - |   33528 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 109,724,591.48 ns | 10,191,909.446 ns | 29,568,577.342 ns | 108,815,782.50 ns | 51,159,775.50 ns | 180,713,226.50 ns |      - |      - |     - |    2488 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  97,721,116.74 ns |  8,854,378.736 ns | 25,404,872.403 ns |  98,165,676.50 ns | 52,096,356.50 ns | 150,054,499.50 ns |      - |      - |     - |    8384 B |


Comparison
-------------

summary:
better: 3, geomean: 3.522
total diff: 3

No Slower results for the provided threshold = 1% and noise filter = 50ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      6.99 |     685948586.00 |      98165676.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      6.06 |     659905224.50 |     108815782.50 |         |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.03 |      23092573.50 |      22400435.00 |         |

Analysis

Here is the change I made on my fork. This branch is based off master, so BenchmarkDotNet may complain about versioning if you just clone this branch alone. I can push a branch off of release/5.0-rc2 if that would be convenient.

Basically, changing the threshold value used in ConcurrentQueueSegment from -1 to a value that allows threads to sleep seems to help threads spend less time spin-waiting. I played around with a few values and found Thread.OptimalMaxSpinWaitsPerSpinIteration gave me the best result, but this was just blindly guessing with various values and may not be the most optimal. Removing the parameter entirely to allow for default behavior with SpinWait.SpinOnce() also improved performance, but not as much as using the Thread.OptimalMaxSpinWaitsPerSpinIteration value.

I'm wondering if there is a case where -1 is still optimal, or could this be changed?

Please let me know if I can include any other information or clarify anything above.

Edit: Needed to remove EPYC results, but this problem does appear on EPYC with similar results to Ryzen. Please see internal email thread for those numbers.

@alexcovington alexcovington added the tenet-performance Performance related issue label Oct 30, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Oct 30, 2020
@ghost
Copy link

ghost commented Oct 30, 2020

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

cc: @kouvel

@kouvel
Copy link
Member

kouvel commented Oct 30, 2020

Thanks @alexcovington, some of those are large differences indeed. Since Sleep(1) can cause very long delays, I'm curious if removing some of the spin-wait paths in ConcurrentQueue instead would also help. Could you please try this change in my branch called CqSpinWaitFix and see how it compares to the baseline on your Ryzen and EPYC machines? There would be more frequent memory/interlocked operations and it may not improve as much as with the Sleep(1), but I'm hoping that the slower times are mostly because of spin-wait lag.

@adamsitnik
Copy link
Member

@kouvel would it be useful to run the TechEmpower benchmarks as well? I could run them on Citrine, AMD, and ARM machines using your and @alexcovington changes.

@stephentoub
Copy link
Member

would it be useful to run the TechEmpower benchmarks as well?

Yes :)

@kouvel
Copy link
Member

kouvel commented Nov 2, 2020

That would be great @adamsitnik, thanks!

@alexcovington
Copy link
Contributor Author

alexcovington commented Nov 2, 2020

@kouvel I'm seeing some changes, but not nearly as drastic as chanigng the threhsold (numbers based off the release/5.0-rc2 branch):

Skylake

summary:
better: 4, geomean: 1.051
worse: 2, geomean: 1.923
total diff: 6

| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      1.93 |     184350066.00 |     355616451.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      1.92 |     181463522.00 |     347893063.50 | bimodal |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)          |      1.07 |          7449.98 |          6942.39 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)         |      1.06 |          8127.04 |          7656.38 |         |
| System.Collections.CreateAddAndClear<String>.ConcurrentQueue(Size: 512)          |      1.04 |          8559.30 |          8245.53 |         |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.03 |      29232848.00 |      28355177.00 |         |

Ryzen

summary:
better: 7, geomean: 1.139
worse: 2, geomean: 1.345
total diff: 9

| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      1.43 |     582391563.50 |     830691209.00 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      1.27 |     693116413.50 |     879139261.00 |         |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.50 |      23355512.00 |      15603404.00 | several?|
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<Int32>.ConcurrentQue |      1.16 |      16426934.50 |      14183772.50 | several?|
| System.Collections.CreateAddAndClear<Int32>.ConcurrentQueue(Size: 512)           |      1.15 |          5029.00 |          4376.70 |         |
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)          |      1.09 |          5156.07 |          4724.18 |         |
| System.Collections.CreateAddAndClear<String>.ConcurrentQueue(Size: 512)          |      1.07 |          5794.20 |          5391.80 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)         |      1.04 |          5984.43 |          5736.33 |         |
| System.Collections.IterateForEach<String>.ConcurrentQueue(Size: 512)             |      1.02 |          5372.92 |          5280.67 |         |

I can email you the EPYC numbers, but the change for EPYC is about the same as Ryzen in this case.

Edit: Pasted the wrong numbers for Ryzen, updated my comment above.

@kouvel
Copy link
Member

kouvel commented Nov 2, 2020

Thanks @alexcovington. Looks like it's just the contention in the same-thread tests. Since there are only two threads in the test doing enqueue-dequeue in a loop, the Sleep(1) would basically turn the test into a single-threaded test for the most part. In the different-thread test the two threads would mostly not be contending with one another. I'm still leaning towards not adding the Sleep(1) due to the long delays it can add.

Do any of you think the same-thread test is realistic enough to be worth optimizing for? Maybe the test can be modified to be a bit more realistic, like for each iteration to do a batch of enqueues, then some random work, then a batch of dequeues with a bit of random work in-between dequeues.

@kouvel
Copy link
Member

kouvel commented Nov 2, 2020

We haven't run the TechEmpower tests with these comparisons so that may also be interesting.

@alexcovington
Copy link
Contributor Author

@kouvel I ran the TechEmpower benchmark using Crank and the following options:

$ crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml --config amd.benchmarks.yml --profile $PROFILE --scenario plaintext --application.framework netcoreapp5.0 --application.options.outputFiles /path/to/runtime/$ARTIFACT_DIR/bin/testhost/\*\* --output $PROFILE.$ARTIFACT_DIR.json

I just want to confirm this is the right way to run the TechEmpower benchmarks with a local build? I'll be sending the numbers in an internal email thread shortly.

@sebastienros
Copy link
Member

@alexcovington that looks correct, and it would be better to confirm you get the same number when building the assets without your changes or when using the nightly runtime.

I assume right now it's using rc2, because these are the latest published bits. You can get the rtm ones (release branch) with --application.framework net5.0 --application.channel edge. The runtime version should be rendered in the output.

You can also use both saved results to generate a comparison table: crank compare file1.json file2.json and share it here.

@adamsitnik
Copy link
Member

I've run the TechEmpower benchmarks using a copy of System.Private.CoreLib.dll from RC2 branch, RC2 branch with @alexcovington changes applied and RC2 branch with @kouvel changes applied.

So far I was able to get only the 12 and 28 core Intel x64 machines results:

Machine Benchmark RC2 Alex Kount
Perf (Intel x64 12 cores) Plaintext 5,874,635 5,867,430 5,856,361
  JSON 635,586 637,510 635,513
  Fortunes      
         
Citrine (Intel x64 28 cores) Plaintext 10,883,034 10,715,820 10,911,054
  JSON 1,204,613 1,208,411 1,208,898
  Fortunes 419,920 418,355 428,360

I don't see a significant difference except for the reproducible +10k for Fortunes with @kouvel changes

@adamsitnik
Copy link
Member

@sebastienros when I am trying to use the aspnet-citrine-amd machine I get the following error:

Job failed at runtime:
WRK Client
args: -c 512 http://10.0.0.106:5000/json --latency -d 15s -w 15s -t 32 --header Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7 --header Connection: keep-alive
[STDERR] Unhandled exception. System.Net.Http.HttpRequestException: No route to host

and I am unable to ping the machine. Is it offline?

For the Mono machine I am getting a different error:

  mono:
    jobs:
      db:
        endpoints: 
          - http://asp-citrine-amd:5001
      application:
        endpoints: 
          - http://asp-mono-lin:5001
        variables:
          databaseServer: 10.0.0.106
      load:
        endpoints: 
          - http://asp-mono-load:5002
        variables:
          serverUri: http://asp-mono-lin
The specified endpoint url 'http://asp-mono-lin:5001' for 'application' is invalid or not responsive: "The operation was canceled."

@sebastienros is there anything I could do to make it work?

@sebastienros
Copy link
Member

@adamsitnik the amd machine is not available anymore. Nic issues, AMD (Alex) couldn't repro the problem, Mellanox (card brand) support didn't help because it's a Dell machine, and the labs people closed the ticket. Next step is for me to contact Dell and hope they can diagnose it.

Then for mono, I know they have moved the machines and got new IPs, but I don't think they gave me the new values or updated the records. I will ask to get the records updated, but you shouldn't use mix a citrine machine with the mono machine if you have other options. I can't guaranty the stability and efficiency of the network between these machines.

@sebastienros
Copy link
Member

@adamsitnik might be worth using a scenario that explicitly uses the structure that is changed in this PR, and I don't know how much Kestrel/Json relies on it, even indirectly (did a profile say?). And maybe just create a more realistic usage of the concurrent queue within a web app? Might make sense for any concurrent data structure btw, though I haven't checked how the micro benchmarks are built.

@sebastienros
Copy link
Member

Forgot to mention that crank can also run BND benchmarks now, without any change on the benchmark. Here is the documentation about the feature, pointing to some example using the dotnet/performance repos: https://github.com/dotnet/crank/blob/master/docs/microbenchmarks.md

This means you can use the labs machines to run the benchmarks, not just the machines you have access to. Or run the micro benchmarks and the TE ones on the same machines.

@adamsitnik
Copy link
Member

@sebastienros I want to use a machine that has more cores than the Citrine machine to see how @alexcovington proposal affects #36447 where profiles have proven that ConcurentQueue can be a performance bottleneck. I just hope that it's going to improve our scalability for machines with many cores.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 6, 2020
@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants