Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regressions in Integer Formatting #54166

Closed
DrewScoggins opened this issue Jun 14, 2021 · 21 comments · Fixed by #57400
Closed

[Perf] Regressions in Integer Formatting #54166

DrewScoggins opened this issue Jun 14, 2021 · 21 comments · Fixed by #57400
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

DrewScoggins commented Jun 14, 2021

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Int32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 24.70 ns 34.44 ns 1.39 0.21
TryFormat - Duration of single invocation 19.39 ns 21.12 ns 1.09 0.17

graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)


System.Tests.Perf_Int32.TryFormat(value: 2147483647)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_UInt64

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormat - Duration of single invocation 10.78 ns 12.11 ns 1.12 0.13

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Version

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormatL - Duration of single invocation 96.02 ns 103.50 ns 1.08 0.15

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Version.TryFormatL


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 28e63279342bd2f6ca43442d864f487613a53bc9
Compare d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff Diff

Regressions in System.Tests.Perf_UInt32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 22.92 ns 25.28 ns 1.10 0.24
TryFormat - Duration of single invocation 17.54 ns 20.77 ns 1.18 0.01

graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)


System.Tests.Perf_UInt32.TryFormat(value: 4294967295)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added arch-arm64 os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Jun 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jun 14, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@DrewScoggins DrewScoggins changed the title [Perf] Changes at 5/18/2021 5:35:47 PM [Perf] Regressions in Integer Formatting Jun 14, 2021
@ghost
Copy link

ghost commented Jun 15, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Int32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 24.70 ns 34.44 ns 1.39 0.21
TryFormat - Duration of single invocation 19.39 ns 21.12 ns 1.09 0.17

graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)


System.Tests.Perf_Int32.TryFormat(value: 2147483647)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_UInt64

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormat - Duration of single invocation 10.78 ns 12.11 ns 1.12 0.13

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Version

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormatL - Duration of single invocation 96.02 ns 103.50 ns 1.08 0.15

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Version.TryFormatL


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 28e63279342bd2f6ca43442d864f487613a53bc9
Compare d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff Diff

Regressions in System.Tests.Perf_UInt32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 22.92 ns 25.28 ns 1.10 0.24
TryFormat - Duration of single invocation 17.54 ns 20.77 ns 1.18 0.01

graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)


System.Tests.Perf_UInt32.TryFormat(value: 4294967295)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: -
Labels:

arch-arm64, area-System.Runtime, os-windows, tenet-performance, tenet-performance-benchmarks, untriaged

Milestone: -

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021
@tannergooding tannergooding added this to the Future milestone Jul 12, 2021
@danmoseley
Copy link
Member

danmoseley commented Aug 10, 2021

@tannergooding looking at the list of changes, fdbef6c popped out, but Dragon doesn't do much actual division does it?
#52893

@tannergooding
Copy link
Member

Dragon is for floating-point. Integer formatting frequently divides by 10: https://source.dot.net/#System.Private.CoreLib/Number.Formatting.cs,c66436177df5dc9b

@stephentoub
Copy link
Member

cc: @pentp

@pentp
Copy link
Contributor

pentp commented Aug 11, 2021

I don't have an arm64 machine to do perf tests, so I'm trying to get some assembly diffs, but I can't get jit-dasm to work with anything I try. From a clean repo I built the following stuff:

.\build.cmd Clr.Runtime+Clr.CoreLib+Clr.NativeCoreLib -a x64 -c Release
.\build.cmd Clr.Runtime+Clr.CoreLib+Clr.NativeCoreLib -a arm64 -c Release
.\build.cmd Clr.AllJits -a x64 -c Checked

I then tried to get dasm output:

jit-dasm-pmi.exe --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v
Copying default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup
JIT DASM PMI failed: Access to the path 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release' is denied.
Restoring default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release
Unhandled exception. System.IO.FileNotFoundException: Could not find file 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup'.

I also tried with -j option and it complains about the file being in use (by itself probably, nothing else has it open):

jit-dasm-pmi.exe -j windows.x64.Release\clrjit.dll --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v
Copying default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit-backup.dll
Copying in the test jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll
JIT DASM PMI failed: The process cannot access the file 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll' because it is being used by another process.

I then tried to manually copy the clrjit.dll files and run with --nocopy:

jit-dasm-pmi.exe -j windows.x64.Release\clrjit.dll --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v --nocopy
...
Running: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\corerun.exe C:\dev\jitutils\bin\pmi.dll DRIVEALL-QUIET windows.arm64.Release\IL\System.Private.CoreLib.dll
...
Managed assembly: C:\dev\jitutils\bin\pmi.dll
Arguments (2): DRIVEALL-QUIET windows.arm64.Release\IL\System.Private.CoreLib.dll
BEGIN: coreclr_initialize failed - Error: 0x80004005
END: coreclr_initialize failed - Error: 0x80004005
1 errors compiling set.

I've clearly misunderstood how these utilities work. Can they even be used to get arm64 asm dumps on win-x64 or is there a better/easier way?

@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Runtime labels Aug 11, 2021
@danmoseley
Copy link
Member

@dotnet/jit-contrib could someone help @pentp here?

I've updated the area to codegen as I can't see a relevant change on the libraries side, let me know if that's not right.

@danmoseley danmoseley removed this from the Future milestone Aug 11, 2021
@danmoseley danmoseley added the untriaged New issue has not been triaged by the area owner label Aug 11, 2021
@danmoseley
Copy link
Member

Untriaging for codegen team to do.

@JulieLeeMSFT JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Aug 11, 2021
@JulieLeeMSFT
Copy link
Member

CC @AndyAyersMS PTAL.

@AndyAyersMS
Copy link
Member

Test history implicate #52893. In particular the change was merged, reverted, and then re-merged, which matches the up-down-up seen here:

newplot (66)

As for how to get diffs, it should be something like:

;; builds as you do above, in both "ref" and "diff" repos

jit-diff diff --base --base_root <path to ref repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-diff diff --diff --diff_root <path to diff repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-analyze --base X\base --diff X\diff

@JulieLeeMSFT
Copy link
Member

Thanks @AndyAyersMS.
@pentp then what is the plan? Are we going to revert #52893?

@AndyAyersMS
Copy link
Member

#52893 clearly improves the cost of division on x64, so at this point I'd say there is something unexpected going on that is arm64 specific and we should try and figure out what that is.

EG Here's the windows x64 results for that same test over the same time interval.

newplot (67)

@pentp if you can't get diffs using the above let me know and I'll actually go try it myself to make sure I didn't leave out anything important.

@AndyAyersMS AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 12, 2021
@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Aug 12, 2021
@AndyAyersMS
Copy link
Member

Marking as 6.0 at least until we've better understood what's going on.

@tannergooding
Copy link
Member

I wonder if its partially due to certain operations, such as integer division, being significantly more expensive on x86/x64 (as you only have div which computes the remainder as well).

On A64, the benefits of avoiding div are lessened and some of the algorithms that avoid div might expect you can have things like lea instead.

@EgorBo
Copy link
Member

EgorBo commented Aug 12, 2021

@AndyAyersMS can it also be PGO related? dc7598e

image

@pentp
Copy link
Contributor

pentp commented Aug 12, 2021

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V
I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it's only used once and it should still be faster than a bunch of lsr/sub/add instructions...

@AndyAyersMS
Copy link
Member

I think these are running on Surface Pro X machines, which are A76 (big).

@echesakov can you ask the Arm folks to take a look and see if they have any ideas?

@echesakov
Copy link
Contributor

@TamarChristinaArm Do you think you or someone on your team would be able to help us with the issue?

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm Do you think you or someone on your team would be able to help us with the issue?

Sure thing!, taking a look.

@TamarChristinaArm
Copy link
Contributor

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V
I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

I think this is correct culprid, the loop has a long dependency chain off the value out of the mul/umull with little independent instructions so it's bound by these two and 64-bit mul has twice the latency of umull.

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it's only used once and it should still be faster than a bunch of lsr/sub/add instructions...

64-bit mul has a latency of 4(3) and umull has 2(1) on Cortex-A76. Were you perhaps comparing the AArch32 instructions? Note that on AArch64 mul and umull aren't real instructions but are architectural aliases for madd and umaddl respectively so you need to look at the latencies for those.

The mul seems to be unneeded though, both inputs are 32-bits so you should be able to just use umull there as before.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 13, 2021
@TamarChristinaArm
Copy link
Contributor

As a side note, while looking at the diff I noticed some other general improvements that could be made in the future:

sxtw    x3, w6	
lsl     x3, x3, #1	
add     x2, x2, x3

can become

add     x2, x2, w6, sxtw, #1

and

lsl     w3, w3, #1
sub     w3, w0, w3

can become

sub     w3, w0, w3, lsl 1

lastly

sub     x2, x2, #2
strh    w1, [x2]

can become

strh    w1, [x2, #-2]!

and probably want to move the constant out of the loop.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 17, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Sep 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet