Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising performance regression on Unix VMs #59145

Closed
adamsitnik opened this issue Sep 15, 2021 · 6 comments · Fixed by #59300
Closed

Surprising performance regression on Unix VMs #59145

adamsitnik opened this issue Sep 15, 2021 · 6 comments · Fixed by #59300
Assignees
Labels
area-System.Threading os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

For the following very simple benchmark:

[Benchmark]
public ConcurrentBag<int> ConcurrentBag() => new ConcurrentBag<int>();

We can observe a huge perf drop for Linux VMs (bare metal machines are not affected).

In the following table all AMD EPYC 7452 machines are Azure VMs, everything else is bare metal:

Result Ratio Modality Operating System Bit Processor Name
Faster 1.73 several? Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.87 Windows 10.0.20348 X64 AMD EPYC 7452
Slower 0.50 several? Windows 10.0.20348 X64 AMD EPYC 7452
Faster 1.28 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.88 several? Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)
Same 1.00 several? Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Same 1.12 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Faster 1.20 bimodal Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 0.97 several? Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 0.92 bimodal Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz
Slower 0.74 several? Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz
Slower 0.04 centos 8 X64 AMD EPYC 7452
Slower 0.03 debian 10 X64 AMD EPYC 7452
Slower 0.22 bimodal rhel 7 X64 AMD EPYC 7452
Slower 0.22 bimodal sles 15 X64 AMD EPYC 7452
Slower 0.03 several? opensuse-leap 15.3 X64 AMD EPYC 7452
Faster 1.20 bimodal ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.91 several? ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.79 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Faster 2.85 ubuntu 16.04 Arm64 Unknown processor
Same 1.01 bimodal Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz
Faster 1.29 bimodal Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz
Same 0.81 multimodal Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.35 several? Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 1.05 multimodal Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz
Same 1.02 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Faster 1.44 several? macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)
Same 1.13 bimodal macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Initially I thought that it might be caused by the fact that I was using VMs with a single physical core (Standard_D2a_v4 Azure VMs with AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core), but for RHEL I used a VM with 2 physical (4 logical) cores (Standard_D4as_v4) and the regression can be observed there as well. The regression does not affect Windows Server (also running on Standard_D2a_v4` Azure VM).

Repro

Create Standard_D2a_v4 Azure VM with CentOS|Rhel|SLES|OpenSUSE-> SSH -> install git and python3 and:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter 'System.Collections.CtorDefaultSize<Int32>.ConcurrentBag'

@jkotas @stephentoub @danmoseley @kouvel In my opinion we need to investigate it and understand the reason why it has regressed before we ship .NET 6.0 as it might be a syndrome of a bigger problem related to Unix VMs.

@adamsitnik adamsitnik added area-System.Collections os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels Sep 15, 2021
@adamsitnik adamsitnik added this to the 6.0.0 milestone Sep 15, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 15, 2021
@ghost
Copy link

ghost commented Sep 15, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

For the following very simple benchmark:

[Benchmark]
public ConcurrentBag<int> ConcurrentBag() => new ConcurrentBag<int>();

We can observe a huge perf drop for Linux VMs (bare metal machines are not affected).

In the following table all AMD EPYC 7452 machines are Azure VMs, everything else is bare metal:

Result Ratio Modality Operating System Bit Processor Name
Faster 1.73 several? Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.87 Windows 10.0.20348 X64 AMD EPYC 7452
Slower 0.50 several? Windows 10.0.20348 X64 AMD EPYC 7452
Faster 1.28 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.88 several? Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)
Same 1.00 several? Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Same 1.12 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Faster 1.20 bimodal Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 0.97 several? Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 0.92 bimodal Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz
Slower 0.74 several? Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz
Slower 0.04 centos 8 X64 AMD EPYC 7452
Slower 0.03 debian 10 X64 AMD EPYC 7452
Slower 0.22 bimodal rhel 7 X64 AMD EPYC 7452
Slower 0.22 bimodal sles 15 X64 AMD EPYC 7452
Slower 0.03 several? opensuse-leap 15.3 X64 AMD EPYC 7452
Faster 1.20 bimodal ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.91 several? ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.79 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Faster 2.85 ubuntu 16.04 Arm64 Unknown processor
Same 1.01 bimodal Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz
Faster 1.29 bimodal Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz
Same 0.81 multimodal Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.35 several? Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 1.05 multimodal Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz
Same 1.02 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Faster 1.44 several? macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)
Same 1.13 bimodal macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Initially I thought that it might be caused by the fact that I was using VMs with a single physical core (Standard_D2a_v4 Azure VMs with AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core), but for RHEL I used a VM with 2 physical (4 logical) cores (Standard_D4as_v4) and the regression can be observed there as well. The regression does not affect Windows Server (also running on Standard_D2a_v4` Azure VM).

Repro

Create Standard_D2a_v4 Azure VM with CentOS|Rhel|SLES|OpenSUSE-> SSH -> install git and python3 and:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter 'System.Collections.CtorDefaultSize<Int32>.ConcurrentBag'

@jkotas @stephentoub @danmoseley @kouvel In my opinion we need to investigate it and understand the reason why it has regressed before we ship .NET 6.0 as it might be a syndrome of a bigger problem related to Unix VMs.

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections, os-linux, tenet-performance

Milestone: 6.0.0

@adamsitnik
Copy link
Member Author

I've just run it on CentOS VM and confirmed that it was not a one time thing:

 BenchmarkDotNet=v0.13.1.1603-nightly, OS=centos 8                                                                                           
 AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core                                                                                   
 .NET SDK=6.0.100-rc.1.21417.19                                                                                                              
   [Host]     : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT                                                                                     
   Job-PVNFCA : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT                


 -------------------- Histogram --------------------
 [284.808 ns ; 327.384 ns) | @@@                    
 [327.384 ns ; 366.386 ns) | @@@@@@@@@              
 [366.386 ns ; 409.337 ns) | @@@@@@                 
 [409.337 ns ; 424.316 ns) |                        
 [424.316 ns ; 467.938 ns) | @@                     
 ---------------------------------------------------
Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Allocated
ConcurrentBag 365.1 ns 35.02 ns 40.33 ns 360.6 ns 304.3 ns 448.4 ns 0.0018 0.0009 128 B
BenchmarkDotNet=v0.13.1.1603-nightly, OS=centos 8         
AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core 
.NET SDK=6.0.100-rc.1.21417.19                            
  [Host]     : .NET 6.0.0 (6.0.21.41701), X64 RyuJIT      
  Job-EUZAQF : .NET 6.0.0 (6.0.21.41701), X64 RyuJIT      

 -------------------- Histogram -------------------- 
 [5.727 us ; 6.274 us) | @@@@                        
 [6.274 us ; 6.800 us) | @@@@@@@@@@                  
 [6.800 us ; 7.074 us) | @                           
 [7.074 us ; 7.599 us) | @@@@                        
 [7.599 us ; 7.958 us) | @                           
 --------------------------------------------------- 
Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentBag 6.660 us 0.4721 us 0.5437 us 6.525 us 5.874 us 7.695 us 0.0021 0.0013 0.0004 128 B

@adamsitnik
Copy link
Member Author

For a ConcurrentBag<string> the regression reproduces only on Linux VMs with 1 physical core.

@jkotas
Copy link
Member

jkotas commented Sep 15, 2021

All that new ConcurrentBag<int> does is new ThreadLocal<WorkStealingQueue>(). Do you see the same regression if you just run new ThreadLocal? #56956 is potentially related. cc @davidwrighton

ThreadLocal depends on finalizer for cleanup. My guess is that the finalizer thread is not able to cleanup the ThreadLocals fast enough in this config due to #56956 or some other subtle change, the ThreadLocal instances accumulate and that makes the microbenchmark much slower.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 15, 2021
@kouvel kouvel self-assigned this Sep 18, 2021
@ghost
Copy link

ghost commented Sep 18, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

For the following very simple benchmark:

[Benchmark]
public ConcurrentBag<int> ConcurrentBag() => new ConcurrentBag<int>();

We can observe a huge perf drop for Linux VMs (bare metal machines are not affected).

In the following table all AMD EPYC 7452 machines are Azure VMs, everything else is bare metal:

Result Ratio Modality Operating System Bit Processor Name
Faster 1.73 several? Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.87 Windows 10.0.20348 X64 AMD EPYC 7452
Slower 0.50 several? Windows 10.0.20348 X64 AMD EPYC 7452
Faster 1.28 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.88 several? Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)
Same 1.00 several? Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Same 1.12 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Faster 1.20 bimodal Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 0.97 several? Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 0.92 bimodal Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz
Slower 0.74 several? Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz
Slower 0.04 centos 8 X64 AMD EPYC 7452
Slower 0.03 debian 10 X64 AMD EPYC 7452
Slower 0.22 bimodal rhel 7 X64 AMD EPYC 7452
Slower 0.22 bimodal sles 15 X64 AMD EPYC 7452
Slower 0.03 several? opensuse-leap 15.3 X64 AMD EPYC 7452
Faster 1.20 bimodal ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.91 several? ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.79 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Faster 2.85 ubuntu 16.04 Arm64 Unknown processor
Same 1.01 bimodal Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz
Faster 1.29 bimodal Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz
Same 0.81 multimodal Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 1.35 several? Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 1.05 multimodal Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz
Same 1.02 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Faster 1.44 several? macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)
Same 1.13 bimodal macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Initially I thought that it might be caused by the fact that I was using VMs with a single physical core (Standard_D2a_v4 Azure VMs with AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core), but for RHEL I used a VM with 2 physical (4 logical) cores (Standard_D4as_v4) and the regression can be observed there as well. The regression does not affect Windows Server (also running on Standard_D2a_v4` Azure VM).

Repro

Create Standard_D2a_v4 Azure VM with CentOS|Rhel|SLES|OpenSUSE-> SSH -> install git and python3 and:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter 'System.Collections.CtorDefaultSize<Int32>.ConcurrentBag'

@jkotas @stephentoub @danmoseley @kouvel In my opinion we need to investigate it and understand the reason why it has regressed before we ship .NET 6.0 as it might be a syndrome of a bigger problem related to Unix VMs.

Author: adamsitnik
Assignees: kouvel
Labels:

area-System.Collections, area-System.Threading, os-linux, tenet-performance

Milestone: 6.0.0

@kouvel kouvel modified the milestones: 6.0.0, 7.0.0 Sep 18, 2021
@kouvel
Copy link
Member

kouvel commented Sep 18, 2021

My guess is that the finalizer thread is not able to cleanup the ThreadLocals fast enough

I believe that is what's happening. When the finalizer/dispose returns an ID, the next ID to try may reset to that ID, which may be a low ID. After that ID is reused, the next constructor has to do more work inside the lock due to the linear lookup for a free ID from a low starting point. The lock would be held for a bit longer and finalization would slow down. Eventually, a balance may be struck where the number of IDs used and IDs queued for freeing are relatively much higher and the perf would be generally slower.

The test seems to be very sensitive to timing and VM configuration. The underlying issue was preexisting and #56956 added a tiny amount of code inside the locks, and along with other variables the timings may have changed enough to create some kind of feedback loop.

Eliminating the linear search seems to fix the issue, and also seems to make the results more stable. On my local VM configured in a similar way on an Intel processor, there was about a 50% regression before and the fix seems to significantly improve the perf.

I'll put up a fix for 7.0. It doesn't seem severe enough to port to 6.0, as the degenerate situation seems unlikely to linger or show up in a significant way in real-world cases.

AMD EPYC 7452 1-core 2-thread Debian VM

5.0:

-------------------- Histogram --------------------
[218.372 ns ; 308.494 ns) | @@@@@@@@@@@
[308.494 ns ; 389.712 ns) | @@@
[389.712 ns ; 479.833 ns) | @@@@@
[479.833 ns ; 552.070 ns) | @
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1.1603-nightly, OS=debian 10
AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core
.NET SDK=6.0.100-rc.1.21417.19
  [Host]     : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
  Job-RFGVOX : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
Method Mean Error StdDev Median Min Max Gen 0 Allocated
ConcurrentBag 331.3 ns 80.92 ns 93.19 ns 307.0 ns 219.5 ns 507.0 ns 0.0011 128 B

6.0 after fix:

-------------------- Histogram --------------------
[252.481 ns ; 262.758 ns) | @
[262.758 ns ; 273.738 ns) | @@@@@@@
[273.738 ns ; 283.887 ns) | @@@@@@@@
[283.887 ns ; 296.699 ns) | @@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1.1603-nightly, OS=debian 10
AMD EPYC 7452, 1 CPU, 2 logical cores and 1 physical core
.NET SDK=6.0.100-rc.1.21417.19
  [Host]     : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-AVFKAA : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
Method Mean Error StdDev Median Min Max Gen 0 Allocated
ConcurrentBag 276.5 ns 9.11 ns 10.50 ns 277.2 ns 257.6 ns 293.9 ns 0.0010 128 B

Intel i7-8700 1-core 2-thread Ubuntu VM:

5.0:

-------------------- Histogram --------------------
[  417.478 ns ;   589.771 ns) | @
[  589.771 ns ;   798.253 ns) | @@@@@@@@
[  798.253 ns ; 1,036.725 ns) | @@@@@@
[1,036.725 ns ; 1,255.634 ns) | @@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1.1603-nightly, OS=ubuntu 20.04
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 2 logical cores and 1 physical core
.NET SDK=6.0.100-rc.1.21417.19
  [Host]     : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
  Job-KTFNDP : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentBag 856.2 ns 187.2 ns 215.6 ns 849.9 ns 521.7 ns 1,229.2 ns 0.0219 0.0109 0.0027 128 B

6.0 after fix:

-------------------- Histogram --------------------
[321.533 ns ; 334.416 ns) | @@@@@@@
[334.416 ns ; 346.646 ns) | @@@@@@@@
[346.646 ns ; 365.368 ns) | @@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1.1603-nightly, OS=ubuntu 20.04
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 2 logical cores and 1 physical core
.NET SDK=6.0.100-rc.1.21417.19
  [Host]     : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DYJEHA : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Allocated
ConcurrentBag 338.6 ns 11.41 ns 12.21 ns 337.7 ns 322.6 ns 362.8 ns 0.0203 0.0101 128 B

kouvel added a commit to kouvel/runtime that referenced this issue Sep 18, 2021
- Replaced the linear search for a free ID with a pair of collections that operate in O(1) time for insertion and removal
- See dotnet#59145 (comment) for more information
- Fixes dotnet#59145
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 18, 2021
kouvel added a commit that referenced this issue Sep 30, 2021
…se (#59300)

- Replaced the linear search for a free ID with a pair of collections that operate in O(1) time for insertion and removal
- See #59145 (comment) for more information
- Fixes #59145
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 30, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants