Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressions in SeekUnroll #60626

Closed
performanceautofiler bot opened this issue Oct 19, 2021 · 6 comments · Fixed by #60721
Closed

Regressions in SeekUnroll #60626

performanceautofiler bot opened this issue Oct 19, 2021 · 6 comments · Fixed by #60721
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark untriaged New issue has not been triaged by the area owner
Milestone

Comments

@performanceautofiler
Copy link

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in PerfLabTests.LowLevelPerf

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ObjectStringIsString - Duration of single invocation 123.40 μs 155.92 μs 1.26 0.06 False
GenericGenericMethod - Duration of single invocation 155.92 μs 211.63 μs 1.36 0.06 False

graph
graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'PerfLabTests.LowLevelPerf*'

Payloads

Baseline
Compare

Histogram

PerfLabTests.LowLevelPerf.ObjectStringIsString


PerfLabTests.LowLevelPerf.GenericGenericMethod


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

### Run Information
Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in SeekUnroll

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Test - Duration of single invocation 1.25 secs 1.56 secs 1.25 0.37 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'SeekUnroll*'

Payloads

Baseline
Compare

Histogram

SeekUnroll.Test(boxedIndex: 3)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in System.Tests.Perf_Char

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
GetUnicodeCategory - Duration of single invocation 4.40 ns 5.74 ns 1.30 0.10 False

Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Tests.Perf_Char*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Char.GetUnicodeCategory(c: 'א')


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@kunalspathak kunalspathak changed the title [Perf] Changes at 10/15/2021 8:43:07 AM Regressions in SeekUnroll Oct 19, 2021
@kunalspathak kunalspathak transferred this issue from dotnet/perf-autofiling-issues Oct 19, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 19, 2021
@kunalspathak
Copy link
Member

Caused by #60228
@jakobbotsch

@kunalspathak kunalspathak added arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Oct 19, 2021
@ghost
Copy link

ghost commented Oct 19, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in PerfLabTests.LowLevelPerf

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ObjectStringIsString - Duration of single invocation 123.40 μs 155.92 μs 1.26 0.06 False
GenericGenericMethod - Duration of single invocation 155.92 μs 211.63 μs 1.36 0.06 False

graph
graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'PerfLabTests.LowLevelPerf*'

Payloads

Baseline
Compare

Histogram

PerfLabTests.LowLevelPerf.ObjectStringIsString


PerfLabTests.LowLevelPerf.GenericGenericMethod


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

### Run Information
Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in SeekUnroll

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Test - Duration of single invocation 1.25 secs 1.56 secs 1.25 0.37 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'SeekUnroll*'

Payloads

Baseline
Compare

Histogram

SeekUnroll.Test(boxedIndex: 3)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 33c0b570c2516747285db3bb63b77a74d4d4246e
Compare 7ecced4a339cb18e59d19c2ffa4e429072e309d9
Diff Diff

Regressions in System.Tests.Perf_Char

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
GetUnicodeCategory - Duration of single invocation 4.40 ns 5.74 ns 1.30 0.10 False

Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Tests.Perf_Char*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Char.GetUnicodeCategory(c: 'א')


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: -
Labels:

os-windows, tenet-performance, tenet-performance-benchmarks, arch-x64, area-CodeGen-coreclr, untriaged

Milestone: -

@kunalspathak
Copy link
Member

More regressions: dotnet/perf-autofiling-issues#1926

@jakobbotsch jakobbotsch self-assigned this Oct 19, 2021
@jakobbotsch jakobbotsch added this to the 7.0.0 milestone Oct 19, 2021
@jakobbotsch
Copy link
Member

Investigating now. Initial results are below.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1288 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-alpha.1.21518.14
  [Host]     : .NET 7.0.0 (7.0.21.48001), X64 RyuJIT
  Job-VWGBDZ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-QJEQHT : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-VXMGKV : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain boxedIndex Mean Error StdDev Median Min Max Ratio Code Size Allocated
SeekUnroll Job-VWGBDZ \main\corerun.exe 1 864.5 ms 0.78 ms 0.65 ms 864.8 ms 863.1 ms 865.5 ms 1.00 392 B -
SeekUnroll Job-QJEQHT \no_lea\corerun.exe 1 867.0 ms 4.41 ms 4.12 ms 867.1 ms 861.0 ms 875.0 ms 1.00 380 B -
SeekUnroll Job-VXMGKV \no_vsd_reg\corerun.exe 1 860.4 ms 1.26 ms 1.17 ms 860.3 ms 858.5 ms 862.9 ms 1.00 392 B -
SeekUnroll Job-VWGBDZ \main\corerun.exe 3 862.5 ms 3.38 ms 3.00 ms 861.3 ms 859.7 ms 869.7 ms 1.00 392 B -
SeekUnroll Job-QJEQHT \no_lea\corerun.exe 3 861.2 ms 1.35 ms 1.13 ms 861.0 ms 859.0 ms 863.6 ms 1.00 380 B 10,224 B
SeekUnroll Job-VXMGKV \no_vsd_reg\corerun.exe 3 865.1 ms 4.56 ms 4.26 ms 864.1 ms 859.9 ms 871.7 ms 1.00 392 B -
SeekUnroll Job-VWGBDZ \main\corerun.exe 11 1,077.1 ms 3.55 ms 3.32 ms 1,077.2 ms 1,071.8 ms 1,084.9 ms 1.00 392 B -
SeekUnroll Job-QJEQHT \no_lea\corerun.exe 11 920.4 ms 2.32 ms 2.17 ms 920.0 ms 917.5 ms 924.2 ms 0.85 380 B -
SeekUnroll Job-VXMGKV \no_vsd_reg\corerun.exe 11 1,076.5 ms 2.53 ms 2.24 ms 1,075.9 ms 1,073.9 ms 1,081.4 ms 1.00 392 B -
SeekUnroll Job-VWGBDZ \main\corerun.exe 19 1,199.4 ms 2.44 ms 2.28 ms 1,199.3 ms 1,195.2 ms 1,203.3 ms 1.00 392 B -
SeekUnroll Job-QJEQHT \no_lea\corerun.exe 19 1,132.8 ms 2.12 ms 1.98 ms 1,132.9 ms 1,130.2 ms 1,136.9 ms 0.94 380 B -
SeekUnroll Job-VXMGKV \no_vsd_reg\corerun.exe 19 1,216.5 ms 2.34 ms 2.07 ms 1,216.8 ms 1,213.2 ms 1,220.4 ms 1.01 392 B -
SeekUnroll Job-VWGBDZ \main\corerun.exe 27 1,412.3 ms 3.22 ms 3.01 ms 1,410.9 ms 1,407.8 ms 1,419.3 ms 1.00 392 B -
SeekUnroll Job-QJEQHT \no_lea\corerun.exe 27 1,498.2 ms 1.30 ms 1.01 ms 1,498.3 ms 1,496.9 ms 1,500.1 ms 1.06 380 B -
SeekUnroll Job-VXMGKV \no_vsd_reg\corerun.exe 27 1,406.6 ms 3.35 ms 3.13 ms 1,407.0 ms 1,401.4 ms 1,413.2 ms 1.00 392 B -

It looks like we are no longer containing some immediates after f40247e. In SeekUnroll.InnerLoop: the diff from that change to main is:

@@ -66,10 +66,11 @@ M00_L02:
        mov       rsi,rcx
        mov       rdi,rdx
        xor       ebx,ebx
-       mov       rcx,7FFD0690BD78
+       mov       rcx,7FFD0690BC80
        mov       edx,2A
        call      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
-       cmp       dword ptr [7FFD06A9C2B8],0
+       mov       rax,7FFD0690C1C0
+       cmp       dword ptr [rax],0
        jle       near ptr M01_L02
 M01_L00:
        vmovupd   ymm0,[rdi]
@@ -104,7 +105,8 @@ M01_L01:
        add       eax,edx
        mov       [rsi],eax
        inc       ebx
-       cmp       ebx,[7FFD06A9C2B8]
+       mov       rax,7FFD0690C1C0
+       cmp       ebx,[rax]
        jl        near ptr M01_L00
 M01_L02:
        vzeroupper
@@ -113,5 +115,5 @@ M01_L02:
        pop       rsi
        pop       rdi
        ret
-; Total bytes of code 196
+; Total bytes of code 208

@jakobbotsch
Copy link
Member

The containment of that immediate seems to flip back and forth, which is strange. In some runs in main it is contained and in some runs it isn't.
It can clearly be contained in the following case since it is very close to the instruction:

[2021/10/20 14:04:26][INFO]  00007ffd`2c989b53        48B8C0C17F2CFD7F0000 mov      rax, 0x7FFD2C7FC1C0
[2021/10/20 14:04:26][INFO]  00007ffd`2c989b5d        3B18                 cmp      ebx, dword ptr [rax]

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Oct 21, 2021
Due to dotnet#60712 we cannot unconditionally generate lea for handles, since
the runtime does not expect us to generate ask for reloc hints for
arbitrary constants. In addition, for many Intel CPUs it looks like
rip-relative lea has greater latency than mov with an 8-byte immediate.

Instead of reverting the change I'm just turning it off here, since the
change also unified a couple of functions to simplify the handling.

Fix dotnet#60627
Fix dotnet#60626
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 21, 2021
jakobbotsch added a commit that referenced this issue Oct 22, 2021
Due to #60712 we cannot unconditionally generate lea for handles, since
the runtime does not expect us to generate ask for reloc hints for
arbitrary constants. In addition, for many Intel CPUs it looks like
rip-relative lea has greater latency than mov with an 8-byte immediate.

Instead of reverting the change I'm just turning it off here, since the
change also unified a couple of functions to simplify the handling.

Fix #60626
Fix #60627
Fix #60629
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Oct 22, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants