Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor GC performance on AMD processors (32-bit runtime) #34478

Closed
jkotas opened this issue Apr 2, 2020 · 10 comments · Fixed by #34488
Closed

Poor GC performance on AMD processors (32-bit runtime) #34478

jkotas opened this issue Apr 2, 2020 · 10 comments · Fixed by #34488
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Apr 2, 2020

I have noticed that GC is running a lot more often than it should on AMD processors (32-bit runtime):

Small repro:

using System;

class Test
{
    static object[] o = new object[1000];

    static void Main()
    {
        Random r = new Random();
        for (;;)
        {
            int start = Environment.TickCount;
            for (int i = 0; i < 1000000000; i++)
                o[r.Next(o.Length)] = new object();
            int end = Environment.TickCount;
            Console.WriteLine(end-start);
        }
   }
}

Results - AMD Ryzen 9 3950X, 32-bit runtime:

Actual: 20+ seconds per iteration
Expected: <15 seconds per iteration (on par with 64-bit runtime)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 2, 2020
@jkotas
Copy link
Member Author

jkotas commented Apr 2, 2020

The problem is that the code to get cache size on x86 is broken for any recent processors: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/util.cpp#L1735

@Maoni0 I remember we talked about deleting the GetCacheSizeFromCpuId path in the past. What kind of testing you would like to see to make this happen?

As far as I can tell, this is GetCacheSizeFromCpuId is broken for any recent AMD processor (and likely Intel as well).

@Maoni0
Copy link
Member

Maoni0 commented Apr 2, 2020

thanks for reporting this, Jan! what are the values you seeing from the CpuId implementation vs the OS impl? and presumably the OS impl returns the correct size?

@Maoni0 Maoni0 added this to the 5.0 milestone Apr 2, 2020
@Maoni0 Maoni0 removed the untriaged New issue has not been triaged by the area owner label Apr 2, 2020
@jkotas
Copy link
Member Author

jkotas commented Apr 2, 2020

I have tried this on:

  • AMD Ryzen 9
  • Intel Xeon E5
  • Intel Core i7

CpuId implementation returns 0 on all them. The penalty for Intel is actually even higher than the penalty for AMD (e.g. the small repro is 1.5x slower than what it should be on Xeon E5)

@jkotas
Copy link
Member Author

jkotas commented Apr 2, 2020

dotnet/coreclr#27323 broke the CpuId path.

@Maoni0
Copy link
Member

Maoni0 commented Apr 2, 2020

ahh, so it was not that it was broken; it just got broken because of #27323?

@jkotas
Copy link
Member Author

jkotas commented Apr 2, 2020

#27323 broke it completely. It worked somewhat before that change, but it still was not quite right.

Before #27323, AMD Ryzen 9 machine:

  • CpuId returns cache size 2,621,440
  • OS returns cache size 16,777,216
  • The above test performance with OS cache size is ~14.4 seconds
  • The above test performance with CpuId cache size is ~15.3 seconds (6% slower)

@Maoni0
Copy link
Member

Maoni0 commented Apr 3, 2020

and 16mb in this case is the right cache size, correct?

@jkotas
Copy link
Member Author

jkotas commented Apr 3, 2020

The spec at https://www.amd.com/en/products/cpu/amd-ryzen-9-3950x says it has total 64MB L3 cache,16 cores, 32 logical CPUs.

GetLogicalProcessorInformation API reports each 4 cores sharing 16MB of L3 cache. 4 * 16MB = 64MB.

What should be the right cache size reported to the GC in this case?

@jkotas
Copy link
Member Author

jkotas commented Apr 3, 2020

FWIW, for the two Intel machines that I have available, the cache sizes reported by the CpuId path and the OS path are the same (with the fix for bug introduced by #27323).

@Maoni0
Copy link
Member

Maoni0 commented Apr 3, 2020

it would be correct to report 16mb to GC in this case. I suspect all the hand written code from before should just be deleted 'cause the OS is able to get the correct cache size. the only reason I didn't get rid of it before was that I didn't have time to test at all on 32-bit.

jkotas added a commit to jkotas/runtime that referenced this issue Apr 3, 2020
jkotas added a commit that referenced this issue Apr 3, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants