Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust weights of blocks with profile inside loops in non-profiled methods #71659

Merged
merged 2 commits into from
Jul 6, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 5, 2022

Fixes #71649 (and a couple of related performance regressions in dotnet/performance).

Repro:

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Threading;

public class Program
{
    public static void Main()
    {
        for (int i = 0; i < 50; i++)
        {
            new Program().Test();
            Thread.Sleep(50);
        }

        Console.WriteLine("Done");
        //Console.ReadKey();
    }

    uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining |
                MethodImplOptions.AggressiveOptimization)]
    public int Test()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }
}

When we run this, Test is not profiled, but it imports inside its loop BitOperations.TrailingZeroCount which is profiled (static BCL profile). Then optSetBlockWeiths divides its weight by 2 because it doesn't dominate all returns and the current method does not have profile data, see here.
At the same time, optScaleLoopBlocks does not touch loop body because it has profile data so it doesn't scale it back to something "hot". My PR changes that - if the root method is not profiled - we try to adjust weights in such loops.

Codegen diff https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=0, Right: ReadyToRun=1 - take a look at loop body (its weight is 1 in case of static pgo so loop aligner does not think it's profitable to align)

cc @AndyAyersMS

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2022
@ghost ghost assigned EgorBo Jul 5, 2022
@ghost
Copy link

ghost commented Jul 5, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #71649 (and a couple of related performance regressions in dotnet/performance).

Repro:

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Threading;

public class Program
{
    public static void Main()
    {
        for (int i = 0; i < 50; i++)
        {
            new Program().Test();
            Thread.Sleep(50);
        }

        Console.WriteLine("Done");
        //Console.ReadKey();
    }

    uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining |
                MethodImplOptions.AggressiveOptimization)]
    public int Test()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }
}

When we run this, Test is not profiled, but it imports inside its loop BitOperations.TrailingZeroCount which is profiled (static BCL profile). Then optSetBlockWeiths divides its weight by 2 because it doesn't dominate all returns and the current method does not have profile data, see here.
At the same time, optScaleLoopBlocks does not touch loop body because it has profile data so it doesn't scale it back to something "hot".

Codegen diff https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=0, Right: ReadyToRun=1 - take a look at loop body (its weight is 1 in case of static pgo so loop aligner does not think it's profitable to align)

cc @AndyAyersMS

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing something like this but was concerned that it might have much wider impact than I'd like. But it might be worth a try.

We should prepare for a follow-on perf assessment to determine if this is a net improvement.

Looking at SPMI there are quite a few affected methods. Oddly no diffs from benchmarks. I wonder if that collection is broken right now or something?

@EgorBo
Copy link
Member Author

EgorBo commented Jul 6, 2022

@AndyAyersMS how do I run SPMI and get a sort of statistics - how many methods with PGO data, how many loops were clonned - etc?

@AndyAyersMS
Copy link
Member

@AndyAyersMS how do I run SPMI and get a sort of statistics - how many methods with PGO data, how many loops were clonned - etc?

You would have to do something custom... we have never gotten around to implementing the "generalized metrics" work that would make this sort of thing easy.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 6, 2022

I hacked it locally - only +10 new loop clonning, around 200 regressions (among 1200 in libraries.pmi) are due to loop-alignment (so number are better with -jitoption JitAlignLoops=0) - the rest look like "random" block shuffling, nothing looks super offensive so far so I assume we can merge and then watch for improvements/regressions.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 12, 2022

Improvements on linux-x64 dotnet/perf-autofiling-issues#6724

@mrsharm
Copy link
Member

mrsharm commented Aug 10, 2022

We noticed that this PR is associated with the following regression across a few configurations. Specifically looking at the x64 Windows historical data, we observed a clear regression:

image

Strangely, we cannot find an associated issue in the Perf Auto-filler and runtime repos for this configuration.

System.Collections.IndexerSet.Array(Size: 512)

Result Ratio Alloc Delta Operating System Bit Processor Name
Same 1.00 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 1.00 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 0.99 +0 macOS Monterey 12.3 Arm64 Apple M1 Max
Slower 0.88 +0 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.83 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.85 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.84 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 0.83 +0 Windows 10 X64 Intel Core i9-10900K CPU 3.70GHz
Slower 0.57 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.59 +0 Windows 11 X64 AMD Ryzen 9 3950X
Slower 0.66 +0 Windows 11 X64 AMD Ryzen 9 5900X
Slower 0.64 +0 Windows 11 X64 AMD Ryzen 9 5950X
Slower 0.83 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.84 +0 Windows 11 X64 Intel Core i9-10900K CPU 3.70GHz
Slower 0.77 +0 Windows 11 X64 11th Gen Intel Core i9-11900H 2.50GHz
Slower 0.86 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Faster 1.16 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.85 +0 ubuntu 18.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.67 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Slower 0.83 +0 ubuntu 20.04 X64 Intel Core i9-10900K CPU 3.70GHz
Same 1.00 +0 Windows 10 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.98 +0 Windows 10 X86 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 1.00 +0 Windows 11 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.89 +0 macOS Big Sur 11.6.8 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.86 +0 macOS Monterey 12.3.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Slower 0.88 +0 macOS Monterey 12.4 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)

@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2022
@danmoseley
Copy link
Member

@EgorBo is the regression above resolved or do we need an issue for it?

@jozkee
Copy link
Member

jozkee commented Oct 13, 2022

Ping @EgorBo. This popped up in the 6.0 vs 7.0 RC2 report as well.

System.Collections.IndexerSet.Array(Size: 512)

Result Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 0.87 +0 ubuntu 18.04 Arm64 Unknown processor
Faster 1.32 +0 Windows 11 Arm64 Unknown processor
Faster 1.13 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Faster 1.15 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Faster 1.25 +0 macOS Monterey 12.6 Arm64 Apple M1
Faster 1.26 +0 macOS Monterey 12.6 Arm64 Apple M1 Max
Slower 0.78 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 0.77 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.64 +0 Windows 11 X64 AMD Ryzen 9 5900X
Slower 0.66 +0 Windows 11 X64 AMD Ryzen 9 7950X
Slower 0.79 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.80 +0 debian 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.67 +0 ubuntu 18.04 X64 AMD Ryzen 9 5900X
Slower 0.86 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.66 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Faster 1.97 +0 ubuntu 20.04 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 0.86 +0 ubuntu 20.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.85 +0 macOS Big Sur 11.7 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.85 +0 macOS Monterey 12.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 17, 2022

Thanks, looking at the issue, it was expected to see improvements and regressions from that change

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PGO: Static profile affects loop alignment where it's not supposed to do
5 participants