Adjust weights of blocks with profile inside loops in non-profiled methods #71659

EgorBo · 2022-07-05T18:11:06Z

Fixes #71649 (and a couple of related performance regressions in dotnet/performance).

Repro:

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Threading;

public class Program
{
    public static void Main()
    {
        for (int i = 0; i < 50; i++)
        {
            new Program().Test();
            Thread.Sleep(50);
        }

        Console.WriteLine("Done");
        //Console.ReadKey();
    }

    uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining |
                MethodImplOptions.AggressiveOptimization)]
    public int Test()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }
}

When we run this, Test is not profiled, but it imports inside its loop BitOperations.TrailingZeroCount which is profiled (static BCL profile). Then optSetBlockWeiths divides its weight by 2 because it doesn't dominate all returns and the current method does not have profile data, see here.
At the same time, optScaleLoopBlocks does not touch loop body because it has profile data so it doesn't scale it back to something "hot". My PR changes that - if the root method is not profiled - we try to adjust weights in such loops.

Codegen diff https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=0, Right: ReadyToRun=1 - take a look at loop body (its weight is 1 in case of static pgo so loop aligner does not think it's profitable to align)

cc @AndyAyersMS

ghost · 2022-07-05T18:11:21Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #71649 (and a couple of related performance regressions in dotnet/performance).

Repro:

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Threading;

public class Program
{
    public static void Main()
    {
        for (int i = 0; i < 50; i++)
        {
            new Program().Test();
            Thread.Sleep(50);
        }

        Console.WriteLine("Done");
        //Console.ReadKey();
    }

    uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining |
                MethodImplOptions.AggressiveOptimization)]
    public int Test()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }
}

When we run this, Test is not profiled, but it imports inside its loop BitOperations.TrailingZeroCount which is profiled (static BCL profile). Then optSetBlockWeiths divides its weight by 2 because it doesn't dominate all returns and the current method does not have profile data, see here.
At the same time, optScaleLoopBlocks does not touch loop body because it has profile data so it doesn't scale it back to something "hot".

Codegen diff https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=0, Right: ReadyToRun=1 - take a look at loop body (its weight is 1 in case of static pgo so loop aligner does not think it's profitable to align)

cc @AndyAyersMS

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2022-07-05T18:14:27Z

Fixes these perf regressions: https://pvscmdupload.blob.core.windows.net/autofilereport/autofilereports/06_28_2022/refs/heads/main_x64_ubuntu%2018.04_Regression/System.Numerics.Tests.Perf_BitOperations.html

AndyAyersMS

I thought about doing something like this but was concerned that it might have much wider impact than I'd like. But it might be worth a try.

We should prepare for a follow-on perf assessment to determine if this is a net improvement.

Looking at SPMI there are quite a few affected methods. Oddly no diffs from benchmarks. I wonder if that collection is broken right now or something?

EgorBo · 2022-07-06T17:42:29Z

@AndyAyersMS how do I run SPMI and get a sort of statistics - how many methods with PGO data, how many loops were clonned - etc?

AndyAyersMS · 2022-07-06T17:45:55Z

@AndyAyersMS how do I run SPMI and get a sort of statistics - how many methods with PGO data, how many loops were clonned - etc?

You would have to do something custom... we have never gotten around to implementing the "generalized metrics" work that would make this sort of thing easy.

EgorBo · 2022-07-06T18:32:54Z

I hacked it locally - only +10 new loop clonning, around 200 regressions (among 1200 in libraries.pmi) are due to loop-alignment (so number are better with -jitoption JitAlignLoops=0) - the rest look like "random" block shuffling, nothing looks super offensive so far so I assume we can merge and then watch for improvements/regressions.

EgorBo · 2022-07-12T18:39:03Z

Improvements on linux-x64 dotnet/perf-autofiling-issues#6724

AndyAyersMS · 2022-07-20T16:50:35Z

Regressions:

arm64: [Perf] Changes at 7/6/2022 9:44:27 PM perf-autofiling-issues#6746
x64 alpine: [Perf] Changes at 7/6/2022 9:44:27 PM perf-autofiling-issues#6717
x64 ubuntu: [Perf] Changes at 7/7/2022 12:54:58 PM perf-autofiling-issues#6704
arm64 ubuntu: [Perf] Changes at 7/6/2022 9:44:27 PM perf-autofiling-issues#6637

mrsharm · 2022-08-10T22:35:26Z

We noticed that this PR is associated with the following regression across a few configurations. Specifically looking at the x64 Windows historical data, we observed a clear regression:

Strangely, we cannot find an associated issue in the Perf Auto-filler and runtime repos for this configuration.

System.Collections.IndexerSet.Array(Size: 512)

Result	Ratio	Alloc Delta	Operating System	Bit	Processor Name
Same	1.00	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	1.00	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	0.99	+0	macOS Monterey 12.3	Arm64	Apple M1 Max
Slower	0.88	+0	Windows 10	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	0.83	+0	Windows 10	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower	0.85	+0	Windows 10	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower	0.84	+0	Windows 10	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	0.83	+0	Windows 10	X64	Intel Core i9-10900K CPU 3.70GHz
Slower	0.57	+0	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	0.59	+0	Windows 11	X64	AMD Ryzen 9 3950X
Slower	0.66	+0	Windows 11	X64	AMD Ryzen 9 5900X
Slower	0.64	+0	Windows 11	X64	AMD Ryzen 9 5950X
Slower	0.83	+0	Windows 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	0.84	+0	Windows 11	X64	Intel Core i9-10900K CPU 3.70GHz
Slower	0.77	+0	Windows 11	X64	11th Gen Intel Core i9-11900H 2.50GHz
Slower	0.86	+0	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Faster	1.16	+0	ubuntu 18.04	X64	Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower	0.85	+0	ubuntu 18.04	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	0.67	+0	ubuntu 20.04	X64	AMD Ryzen 9 5900X
Slower	0.83	+0	ubuntu 20.04	X64	Intel Core i9-10900K CPU 3.70GHz
Same	1.00	+0	Windows 10	X86	Intel Xeon CPU E5-1650 v4 3.60GHz
Same	0.98	+0	Windows 10	X86	Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same	1.00	+0	Windows 11	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	0.89	+0	macOS Big Sur 11.6.8	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower	0.86	+0	macOS Monterey 12.3.1	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Slower	0.88	+0	macOS Monterey 12.4	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)

danmoseley · 2022-09-12T17:34:35Z

@EgorBo is the regression above resolved or do we need an issue for it?

jozkee · 2022-10-13T22:00:07Z

Ping @EgorBo. This popped up in the 6.0 vs 7.0 RC2 report as well.

System.Collections.IndexerSet.Array(Size: 512)

Result	Ratio	Alloc Delta	Operating System	Bit	Processor Name
Slower	0.87	+0	ubuntu 18.04	Arm64	Unknown processor
Faster	1.32	+0	Windows 11	Arm64	Unknown processor
Faster	1.13	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Faster	1.15	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Faster	1.25	+0	macOS Monterey 12.6	Arm64	Apple M1
Faster	1.26	+0	macOS Monterey 12.6	Arm64	Apple M1 Max
Slower	0.78	+0	Windows 10	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	0.77	+0	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	0.64	+0	Windows 11	X64	AMD Ryzen 9 5900X
Slower	0.66	+0	Windows 11	X64	AMD Ryzen 9 7950X
Slower	0.79	+0	Windows 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	0.80	+0	debian 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	0.67	+0	ubuntu 18.04	X64	AMD Ryzen 9 5900X
Slower	0.86	+0	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	0.66	+0	ubuntu 20.04	X64	AMD Ryzen 9 5900X
Faster	1.97	+0	ubuntu 20.04	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	0.86	+0	ubuntu 20.04	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	0.85	+0	macOS Big Sur 11.7	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower	0.85	+0	macOS Monterey 12.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

EgorBo · 2022-10-17T22:10:58Z

Thanks, looking at the issue, it was expected to see improvements and regressions from that change

Don't ignore blocks with weights inside non-profiled methods

ce68eb7

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2022

ghost assigned EgorBo Jul 5, 2022

Update optimizer.cpp

8658a98

AndyAyersMS approved these changes Jul 6, 2022

View reviewed changes

runfoapp bot mentioned this pull request Jul 6, 2022

system.text.regularexpressions.tests Failing on ARM64 linux #71722

Closed

EgorBo merged commit f29ae44 into dotnet:main Jul 6, 2022

DrewScoggins mentioned this pull request Jul 12, 2022

[Perf] Regressions in System.Tests.Perf_String #72026

Closed

EgorBo mentioned this pull request Jul 12, 2022

[Perf] Changes at 7/6/2022 9:44:27 PM dotnet/perf-autofiling-issues#6724

Closed

EgorBo deleted the fix-profile-weights branch July 12, 2022 18:39

This was referenced Jul 14, 2022

[Perf] Changes at 7/6/2022 9:44:27 PM dotnet/perf-autofiling-issues#6747

Closed

[Perf] Changes at 7/13/2022 4:49:41 PM dotnet/perf-autofiling-issues#6745

Closed

[Perf] Changes at 7/6/2022 9:44:27 PM dotnet/perf-autofiling-issues#6756

Closed

AndyAyersMS mentioned this pull request Jul 20, 2022

[Perf] Changes at 7/6/2022 9:44:27 PM dotnet/perf-autofiling-issues#6762

Closed

AndyAyersMS mentioned this pull request Jul 21, 2022

[Perf] Changes at 7/6/2022 9:44:27 PM dotnet/perf-autofiling-issues#6702

Closed

JulieLeeMSFT mentioned this pull request Jul 28, 2022

What's new in .NET 7 Preview 7 [WIP] dotnet/core#7455

Closed

AndyAyersMS mentioned this pull request Aug 6, 2022

LinqBenchmarks.Where00ForX has regressed on x86 #67968

Closed

mrsharm mentioned this pull request Aug 12, 2022

.NET 7.0 Preview 7 Microbenchmarks Performance Study Report #73866

Closed

24 tasks

ghost locked as resolved and limited conversation to collaborators Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust weights of blocks with profile inside loops in non-profiled methods #71659

Adjust weights of blocks with profile inside loops in non-profiled methods #71659

EgorBo commented Jul 5, 2022 •

edited

Loading

ghost commented Jul 5, 2022

EgorBo commented Jul 5, 2022

AndyAyersMS left a comment

EgorBo commented Jul 6, 2022

AndyAyersMS commented Jul 6, 2022

EgorBo commented Jul 6, 2022

EgorBo commented Jul 12, 2022

AndyAyersMS commented Jul 20, 2022 •

edited

Loading

mrsharm commented Aug 10, 2022

danmoseley commented Sep 12, 2022

jozkee commented Oct 13, 2022 •

edited

Loading

System.Collections.IndexerSet.Array(Size: 512)

EgorBo commented Oct 17, 2022

Adjust weights of blocks with profile inside loops in non-profiled methods #71659

Adjust weights of blocks with profile inside loops in non-profiled methods #71659

Conversation

EgorBo commented Jul 5, 2022 • edited Loading

ghost commented Jul 5, 2022

EgorBo commented Jul 5, 2022

AndyAyersMS left a comment

Choose a reason for hiding this comment

EgorBo commented Jul 6, 2022

AndyAyersMS commented Jul 6, 2022

EgorBo commented Jul 6, 2022

EgorBo commented Jul 12, 2022

AndyAyersMS commented Jul 20, 2022 • edited Loading

mrsharm commented Aug 10, 2022

danmoseley commented Sep 12, 2022

jozkee commented Oct 13, 2022 • edited Loading

System.Collections.IndexerSet.Array(Size: 512)

EgorBo commented Oct 17, 2022

EgorBo commented Jul 5, 2022 •

edited

Loading

AndyAyersMS commented Jul 20, 2022 •

edited

Loading

jozkee commented Oct 13, 2022 •

edited

Loading