Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray #67176

Closed
danmoseley opened this issue Mar 25, 2022 · 24 comments
Closed

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray #67176

danmoseley opened this issue Mar 25, 2022 · 24 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@danmoseley
Copy link
Member

danmoseley commented Mar 25, 2022

broken out of #67101

image

Commit range: 6bf873a...bc5e386.

Seems the only relevant change is JIT: enable reading PGO data when switching to optimized (https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)

category:performance
theme:benchmarks

@danmoseley danmoseley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 25, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 25, 2022
@ghost
Copy link

ghost commented Mar 25, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

broken out of #67101

image

Commit range: 6bf873a...bc5e386.

Seems the only relevant change is JIT: enable reading PGO data when switching to optimized (https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)

Author: danmoseley
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 28, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Mar 28, 2022
@AndyAyersMS
Copy link
Member

Hopefully addressed by the PGO update we're waiting for... cc @EgorBo

If not I'll take a look...

@AndyAyersMS
Copy link
Member

Windows ARM64

newplot - 2022-04-25T135849 805

Related benchmark shows similar but smaller regression

newplot - 2022-04-25T140435 617

Windows x64 shows similar but much smaller regression

newplot - 2022-04-25T140117 624

These are likely the same issues we see in #67318 where block reordering driven by PGO data is interfering with loop optimizations. Will need to drill in to be sure.

@AndyAyersMS
Copy link
Member

Regression has persisted even after we deferred some block reordering in #69878
newplot - 2022-06-10T105854 992

Need to investigate this one.

@AndyAyersMS
Copy link
Member

Can repro this locally, digging in.

Method Job Toolchain TestCase Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
EnumerateArray Job-YJYIXG \base-rel\corerun.exe Json400KB 2.052 us 0.0020 us 0.0019 us 2.052 us 2.049 us 2.055 us 1.00 - NA
EnumerateArray Job-SGSSCH \diff-rel\corerun.exe Json400KB 2.400 us 0.0045 us 0.0042 us 2.398 us 2.396 us 2.410 us 1.17 - NA

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 30, 2022

Trying to find a way to run these with the -p ETW but running into issues because the host is arm64.

Got past it by updating BDN to use BenchmarkDotNet.Diagnostics.Windows 3.0.1 and doing a private build (which I had to do over on an x64 box, see dotnet/BenchmarkDotNet#2028) and then updating the performance repo to use this build of BDN.

@AndyAyersMS
Copy link
Member

Assembly for the key loops. Main differences are:

  • the block layout. In the after picture one of the loop blocks (IG25) ends up being placed after some non loop blocks (not shown)
  • the address of the callee CheckNotDisposed is indirect in the after picture, and the indirection cell requires a large constant materialization
  • there is one missed CSE.

Likely it is the difference in the call to CheckNotDisposed that accounts for most of the perf difference. Because the indir introduced for the call only shows up in lowering it's not a candidate for CSE/hoisting (there are two other call sites in this method). We should possibly expand this earlier.

Hoisting the IAT_PVALUE indir may end up being risky as the indir target presumably might get changed. Perhaps for frequently running loops we could split the loop somehow and refresh the hoisted value every so often.

But we could easily hoist/cse the creation of the indir addresses.

Before After
G_M9251_IG17:              ;; offset=0180H
        B9404BB3          ldr     w19, [fp,#72]
        B9404FB4          ldr     w20, [fp,#76]
        6B14027F          cmp     w19, w20
        5400062A          bge     G_M9251_IG29

G_M9251_IG18:              ;; offset=0190H
        7100027F          cmp     w19, #0
        540000AA          bge     G_M9251_IG20

G_M9251_IG19:              ;; offset=0198H
        B9405BA0          ldr     w0, [fp,#88]
        11003000          add     w0, w0, #12
        B9004BA0          str     w0, [fp,#72]
        1400001D          b       G_M9251_IG24

G_M9251_IG20:              ;; offset=01A8H
        F9402BB5          ldr     x21, [fp,#80]
        39C002BF          ldrsb   wzr, [x21]
        AA1503E0          mov     x0, x21




        ;; CheckNotDisposed
        97F95131          bl      **
        F9401AA0          ldr     x0, [x21,#48]
        B50000A0          cbnz    x0, G_M9251_IG21
        350003B3          cbnz    w19, G_M9251_IG27
        AA1F03E1          mov     x1, xzr
        2A1F03E2          mov     w2, wzr
        14000009          b       G_M9251_IG22

G_M9251_IG21:              ;; offset=01D0H
        B9400801          ldr     w1, [x0,#8]
        6B13003F          cmp     w1, w19
        540002E3          blo     G_M9251_IG27
        91004002          add     x2, x0, #16
        4B130020          sub     w0, w1, w19
        2A1303E1          mov     w1, w19
        8B010041          add     x1, x2, x1
        2A0003E2          mov     w2, w0

G_M9251_IG22:              ;; offset=01F0H
        7100305F          cmp     w2, #12
        540001AB          blt     G_M9251_IG26
        B940003F          ldr     wzr, [x1]
        B9400820          ldr     w0, [x1,#8]
        531C7C01          lsr     w1, w0, #28
        53001C21          uxtb    w1, w1
        7100143F          cmp     w1, #5
        5400018B          blt     G_M9251_IG28
        11003261          add     w1, w19, #12

G_M9251_IG23:              ;; offset=0214H
        B9004BA1          str     w1, [fp,#72]

G_M9251_IG24:              ;; offset=0218H
        B9404BA0          ldr     w0, [fp,#72]
        6B14001F          cmp     w0, w20
        54FFFB0B          blt     G_M9251_IG17
G_M9251_IG14:              ;; offset=0168H
        B94043B4          ldr     w20, [fp,#64]
        B94047B6          ldr     w22, [fp,#68]
        6B16029F          cmp     w20, w22
        540007AA          bge     G_M9251_IG27

G_M9251_IG15:              ;; offset=0178H
        7100029F          cmp     w20, #0
        540000AA          bge     G_M9251_IG17

G_M9251_IG16:              ;; offset=0180H
        B94053A0          ldr     w0, [fp,#80]
        11003000          add     w0, w0, #12
        B90043A0          str     w0, [fp,#64]
        1400001D          b       G_M9251_IG21

G_M9251_IG17:              ;; offset=0190H
        F94027B3          ldr     x19, [fp,#72]
        39C0027F          ldrsb   wzr, [x19]
        AA1303E0          mov     x0, x19
        D2935001          movz    x1, #0x9a80    
        F2AE1C21          movk    x1, #0x70e1 LSL #16
        F2CFFFA1          movk    x1, #0x7ffd LSL #32
        F9400021          ldr     x1, [x1]
        ;; CheckNotDisposed
        D63F0020          blr     x1            
        F9401A60          ldr     x0, [x19,#48]
        B4000480          cbz     x0, G_M9251_IG25
        B9400801          ldr     w1, [x0,#8]
        6B14003F          cmp     w1, w20
        54000283          blo     G_M9251_IG23
        91004001          add     x1, x0, #16
        2A1403E2          mov     w2, w20
        8B020021          add     x1, x1, x2
        B9400800          ldr     w0, [x0,#8]    ;; -CSE
        4B140002          sub     w2, w0, w20

G_M9251_IG18:              ;; offset=01D8H
        7100305F          cmp     w2, #12
        5400026B          blt     G_M9251_IG24
        B940003F          ldr     wzr, [x1]
        B9400820          ldr     w0, [x1,#8]
        531C7C01          lsr     w1, w0, #28
        53001C21          uxtb    w1, w1
        7100143F          cmp     w1, #5
        5400030B          blt     G_M9251_IG26

G_M9251_IG19:              ;; offset=01F8H
        11003281          add     w1, w20, #12

G_M9251_IG20:              ;; offset=01FCH
        B90043A1          str     w1, [fp,#64]

G_M9251_IG21:              ;; offset=0200H
        B94043A0          ldr     w0, [fp,#64]
        6B16001F          cmp     w0, w22
        54FFFB0B          blt     G_M9251_IG14

G_M9251_IG25:              ;; offset=0244H
        35FFFE74          cbnz    w20, G_M9251_IG23
        AA1F03E1          mov     x1, xzr
        2A1F03E2          mov     w2, wzr
        17FFFFE2          b       G_M9251_IG18

@AndyAyersMS
Copy link
Member

I cherry-picked the changes for #70988 and while I see more compact call in AFTER case, it doesn't make the benchmark any faster. So seemingly the perf issue lies elsewhere?

Profiling is strongly implicating code within the method. Comparing two BDN runs with same number of iterations/invocations we see all the extra exclusive cycles in EnumerateArray.

image

@AndyAyersMS
Copy link
Member

Looking at the raw sample data from ETL it seems to suggest this bit of code in the AFTER version, right after the bl is especially costly:

        F9401A60          ldr     x0, [x19,#48]
        B4000480          cbz     x0, G_M9251_IG25
        B9400801          ldr     w1, [x0,#8]
        6B14003F          cmp     w1, w20
        54000283          blo     G_M9251_IG23

The BEFORE code layout inverts the first branch.

x0 is never zero here, so the branch is always/never taken in BEFORE/AFTER and should be quite predictable. I can't see offhand why this bit of code should be slower; seems like a not-taken branch should be cheaper in general. There is a dependence chain here but it's the same in the BEFORE code.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 30, 2022

Ubuntu arm64 shows similar regression at the same change. Feels like I must be missing something obvious here.

newplot - 2022-06-30T160523 677

cc @EgorBo

@AndyAyersMS
Copy link
Member

Also (note to self) perfview will dump the raw counts, if you tell it to annotate the source for a method, and look in the log, you'll see something like:

GetSourceLine: Getting source line for code address index 0000215A
GetSourceLine: address for code address is 7ffd70e54ad4 module microbenchmarks
GetSourceLine: Found JITTed method System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray(), index 0000086E token 6001add
GetSourceLine: Found an il-to-native mapping MethodIdx 0000086E Start 7ffd70e54a80 Len 33c
GetSourceLine: NativeOffset 54 ILOffset = c
FindSymbolFilePath: *{ Locating PDB c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb GUID d7c3c786-f5ca-4fea-a8a2-646b668b2aba Age 1 Version 
FindSymbolFilePath: Pdb is for DLL c:\repos\performance\artifacts\bin\microbenchmarks\release\net7.0\e3f010d1-375f-440a-8d2c-94eda8290ad4\bin\release\net7.0\publish\microbenchmarks.dll
FindSymbolFilePath: } Hit Cache, returning c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb
Opened Pdb file c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb
Opened Portable Pdb Source File: c:\repos\performance\src\benchmarks\micro\libraries\System.Text.Json\Document\EnumerateArray.cs

Metric as a function of code address
      Address    :   Line     Metric
    7ffd70e54a94 :     57        1.0
    7ffd70e54aa8 :     57        1.0
    7ffd70e54ab4 :     57        3.0
    7ffd70e54ab8 :     57        1.0
    7ffd70e54ad4 :     57        3.0
    7ffd70e54ad8 :     57        3.0
    7ffd70e54ae4 :     57        1.0
    7ffd70e54aec :     57        1.0
    7ffd70e54b08 :     57        1.0
    7ffd70e54b10 :     57        2.0
    7ffd70e54b20 :     57        2.0
    7ffd70e54b60 :      0        1.0
    7ffd70e54b6c :      0        1.0
    7ffd70e54b88 :      0        1.0
    7ffd70e54bc4 :      0        4.0
    7ffd70e54bdc :      0        7.0
    7ffd70e54be8 :      0      206.0
    7ffd70e54bec :      0       25.0
    7ffd70e54bf0 :      0       29.0

This gives similar (though not exactly the same) data as my instructions retired tool with the new -show-samples option, which for the AFTER run above produces:

Samples for corerun: 5935 events for Benchmark Intervals
Jitting           : 00.52% 2.3E+05  samples 1729 methods
  JitInterface    : 00.23% 1E+05    samples
Jit-generated code: 99.28% 4.41E+07 samples
  Jitted code     : 99.28% 4.41E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 99.28% 4.41E+07 samples
  R2R code        : 00.00% 0        samples

93.09%   4.136E+07   Tier-1   [MicroBenchmarks]Perf_EnumerateArray.EnumerateArray()
05.96%   2.65E+06    Tier-1   [System.Text.Json]JsonDocument.CheckNotDisposed()
00.29%   1.3E+05     native   clrjit.dll
00.23%   1E+05       Tier-1   [e3f010d1-375f-440a-8d2c-94eda8290ad4]Runnable_0.WorkloadActionUnroll(int64)
00.14%   6E+04       native   coreclr.dll
00.14%   6E+04       native   ntoskrnl.exe
Raw samples for [MicroBenchmarks]Perf_EnumerateArray.EnumerateArray() at 0x00007FFD70E54A80 -- 0x00007FFD70E54DBC (length 0x033C)
0x0014 : 1
0x0028 : 1
0x0034 : 3
0x0038 : 1
0x0058 : 3
0x0064 : 1
0x006C : 1
0x0088 : 1
0x0090 : 2
0x00A0 : 2
0x00E0 : 1
0x00EC : 1
0x0108 : 1
0x0144 : 4
0x015C : 6
0x0168 : 186

The difference may be that my tool uses the exact BDN events to filter the samples, while the perfview report above is using an "include filter" on WorkloadActionUnroll and so may capture a bit more.

@AndyAyersMS
Copy link
Member

Similar results on the M1

BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.3 (21E230) [Darwin 21.4.0]
Apple M1 Max, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 6.0.3 (6.0.322.12309), Arm64 RyuJIT
  Job-VMNNRZ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
  Job-DELMRH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20  
MinIterationCount=15  WarmupCount=1  
Method Job Toolchain TestCase Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
EnumerateArray Job-VMNNRZ AFTER Json400KB 2.314 us 0.0020 us 0.0018 us 2.314 us 2.313 us 2.318 us 1.22 - NA
EnumerateArray Job-DELMRH BEFORE Json400KB 1.901 us 0.0030 us 0.0027 us 1.902 us 1.895 us 1.905 us 1.00 - NA

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

@AndyAyersMS so if I understand you correctly you blame an indirect call to CallNotDisposed (via PRECODE or maybe even via jump-stub + precode)

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

@AndyAyersMS

it doesn't make the benchmark any faster. So seemingly the perf issue lies elsewhere?

Good news, turns out my PR #70988 doesn't need && ftn->HasNativeCodeSlot() check. With it being removed I now see this codegen diff: https://www.diffchecker.com/nKMTF7nE

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

However, the actual improvement from the direct call is still small (~1%)

@AndyAyersMS
Copy link
Member

Right, the older version of #70988 had the same effect on the codegen and didn't fix the perf issue here.

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

The benchmark is basically "fixed" with #71534

@AndyAyersMS
Copy link
Member

Interesting. Would still be nice to understand why there was a regression here. Since it repros on many different devices it seems like it is something fundamental and not just a microarchitectural problem. Seems like there might be something important to be learned.

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

I agree, but I wasn't able to identify anything unusual so far in VTune on Windows-x64, I guess I need to do it on Windows-ARM64 where regression is way more visible. On win-x64 it looks like it slightly fixed itself over time:

image

@AndyAyersMS
Copy link
Member

My current hunch is that something in the stub chain between this method and CheckIfNotDisposed ends up blowing the return branch predictor and this is why we see extra sample hits just after the bl. But I could be wrong. Trying to step through the actual code sequence now to see for sure.

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2022

Btw, performance was slightly faster (>2%) if I remove this condition + #70988 (without && ftn->HasNativeCodeSlot() check)

It means - just call absolute address, don't try to reloc it (or emit jump stubs)

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jul 8, 2022

Running on an ampere, from perf, overview:

BASE

Performance counter stats for 'dotnet run -c Release -f net7.0 -- --filter System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray --parameter-filter TestCase:Json400KB --corerun /home/andya/bugs/r67176/base-rel/corerun --iterationCount 100 --invocationCount 100000':

         53,735.74 msec task-clock                #    1.185 CPUs utilized          
            50,373      context-switches          #  937.421 /sec                   
                91      cpu-migrations            #    1.693 /sec                   
           422,809      page-faults               #    7.868 K/sec                  
   152,920,646,006      cycles                    #    2.846 GHz                    
   230,502,866,275      instructions              #    1.51  insn per cycle         
   <not supported>      branches                                                    
       986,085,875      branch-misses                                               

      45.357072152 seconds time elapsed

      49.674995000 seconds user
       4.062895000 seconds sys

DIFF

Performance counter stats for 'dotnet run -c Release -f net7.0 -- --filter System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray --parameter-filter TestCase:Json400KB --corerun /home/andya/bugs/r67176/diff-rel/corerun --iterationCount 100 --invocationCount 100000':

         61,693.81 msec task-clock                #    1.185 CPUs utilized          
            48,765      context-switches          #  790.436 /sec                   
                91      cpu-migrations            #    1.475 /sec                   
           430,535      page-faults               #    6.979 K/sec                  
   176,402,651,222      cycles                    #    2.859 GHz                    
   237,532,018,384      instructions              #    1.35  insn per cycle         
   <not supported>      branches                                                    
       979,340,016      branch-misses                                               

      52.049848600 seconds time elapsed

      57.298960000 seconds user
       4.391213000 seconds sys

So seems to be primarily a stall issue. Drilling in a bit:

BASE

    37,537,472,128      stalled-cycles-backend    #    0.00% backend cycles idle    
    29,884,183,341      stalled-cycles-frontend  

DIFF

    51,543,373,659      stalled-cycles-backend    #    0.00% backend cycles idle    
    29,460,067,083      stalled-cycles-frontend  

So looks like a backend stall. Was not able to pin this down further, even following approach here: Arm Neoverse N1 Core: Performance Analysis Methodology) to any particular backend event. Also consulted the Arm_Neoverse_N1_PMU_Guide.

Sampling on stall_backend and annotating (DIFF on left, BASE on right), we still see the hits clustering after the BLR in the DIFF codegen (codegen is very similar to windows codegen above).

image

@AndyAyersMS
Copy link
Member

This regression is now fixed thanks to @EgorBo and #71534 (note right after the fix went in, we had a big widespread regression which is also now fixed):

newplot - 2022-07-11T134502 662

I still would like to better understand the root cause of this regression, but it's no longer a 7.0 issue, so will move this to future.

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, Future Jul 11, 2022
@AndyAyersMS
Copy link
Member

We no longer benchmark that bit of hardware, so not much we can do at this point.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants