[Perf] Regressions in Integer Formatting #54166

DrewScoggins · 2021-06-14T18:58:03Z

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Int32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	24.70 ns	34.44 ns	1.39	0.21
TryFormat - Duration of single invocation	19.39 ns	21.12 ns	1.09	0.17

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)

System.Tests.Perf_Int32.TryFormat(value: 2147483647)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_UInt64

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormat - Duration of single invocation	10.78 ns	12.11 ns	1.12	0.13

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Version

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormatL - Duration of single invocation	96.02 ns	103.50 ns	1.08	0.15

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Version.TryFormatL

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	28e63279342bd2f6ca43442d864f487613a53bc9
Compare	d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff	Diff

Regressions in System.Tests.Perf_UInt32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	22.92 ns	25.28 ns	1.10	0.24
TryFormat - Duration of single invocation	17.54 ns	20.77 ns	1.18	0.01

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)

System.Tests.Perf_UInt32.TryFormat(value: 4294967295)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-06-14T18:58:06Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2021-06-15T15:33:30Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Int32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	24.70 ns	34.44 ns	1.39	0.21
TryFormat - Duration of single invocation	19.39 ns	21.12 ns	1.09	0.17

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)

System.Tests.Perf_Int32.TryFormat(value: 2147483647)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_UInt64

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormat - Duration of single invocation	10.78 ns	12.11 ns	1.12	0.13

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Version

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormatL - Duration of single invocation	96.02 ns	103.50 ns	1.08	0.15

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Version.TryFormatL

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	28e63279342bd2f6ca43442d864f487613a53bc9
Compare	d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff	Diff

Regressions in System.Tests.Perf_UInt32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	22.92 ns	25.28 ns	1.10	0.24
TryFormat - Duration of single invocation	17.54 ns	20.77 ns	1.18	0.01

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)

System.Tests.Perf_UInt32.TryFormat(value: 4294967295)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	DrewScoggins
Assignees:	-
Labels:	`arch-arm64`, `area-System.Runtime`, `os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `untriaged`
Milestone:	-

danmoseley · 2021-08-10T05:17:29Z

@tannergooding looking at the list of changes, fdbef6c popped out, but Dragon doesn't do much actual division does it?
#52893

tannergooding · 2021-08-10T05:21:39Z

Dragon is for floating-point. Integer formatting frequently divides by 10: https://source.dot.net/#System.Private.CoreLib/Number.Formatting.cs,c66436177df5dc9b

stephentoub · 2021-08-10T11:52:20Z

cc: @pentp

pentp · 2021-08-11T21:43:02Z

I don't have an arm64 machine to do perf tests, so I'm trying to get some assembly diffs, but I can't get jit-dasm to work with anything I try. From a clean repo I built the following stuff:

.\build.cmd Clr.Runtime+Clr.CoreLib+Clr.NativeCoreLib -a x64 -c Release
.\build.cmd Clr.Runtime+Clr.CoreLib+Clr.NativeCoreLib -a arm64 -c Release
.\build.cmd Clr.AllJits -a x64 -c Checked

I then tried to get dasm output:

jit-dasm-pmi.exe --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v
Copying default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup
JIT DASM PMI failed: Access to the path 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release' is denied.
Restoring default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release
Unhandled exception. System.IO.FileNotFoundException: Could not find file 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\-backup'.

I also tried with -j option and it complains about the file being in use (by itself probably, nothing else has it open):

jit-dasm-pmi.exe -j windows.x64.Release\clrjit.dll --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v
Copying default jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit-backup.dll
Copying in the test jit: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll ==> C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll
JIT DASM PMI failed: The process cannot access the file 'C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\clrjit.dll' because it is being used by another process.

I then tried to manually copy the clrjit.dll files and run with --nocopy:

jit-dasm-pmi.exe -j windows.x64.Release\clrjit.dll --altjit windows.x64.Checked\clrjit_win_arm64_x64.dll --corerun windows.x64.Release\corerun.exe windows.arm64.Release\IL\System.Private.CoreLib.dll -v --nocopy
...
Running: C:\dev\ref-runtime\artifacts\bin\coreclr\windows.x64.Release\corerun.exe C:\dev\jitutils\bin\pmi.dll DRIVEALL-QUIET windows.arm64.Release\IL\System.Private.CoreLib.dll
...
Managed assembly: C:\dev\jitutils\bin\pmi.dll
Arguments (2): DRIVEALL-QUIET windows.arm64.Release\IL\System.Private.CoreLib.dll
BEGIN: coreclr_initialize failed - Error: 0x80004005
END: coreclr_initialize failed - Error: 0x80004005
1 errors compiling set.

I've clearly misunderstood how these utilities work. Can they even be used to get arm64 asm dumps on win-x64 or is there a better/easier way?

danmoseley · 2021-08-11T22:03:09Z

@dotnet/jit-contrib could someone help @pentp here?

I've updated the area to codegen as I can't see a relevant change on the libraries side, let me know if that's not right.

danmoseley · 2021-08-11T22:04:32Z

Untriaging for codegen team to do.

JulieLeeMSFT · 2021-08-11T23:47:18Z

CC @AndyAyersMS PTAL.

AndyAyersMS · 2021-08-12T00:00:51Z

Test history implicate #52893. In particular the change was merged, reverted, and then re-merged, which matches the up-down-up seen here:

As for how to get diffs, it should be something like:

;; builds as you do above, in both "ref" and "diff" repos

jit-diff diff --base --base_root <path to ref repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-diff diff --diff --diff_root <path to diff repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-analyze --base X\base --diff X\diff

JulieLeeMSFT · 2021-08-12T01:09:31Z

Thanks @AndyAyersMS.
@pentp then what is the plan? Are we going to revert #52893?

AndyAyersMS · 2021-08-12T01:18:12Z

#52893 clearly improves the cost of division on x64, so at this point I'd say there is something unexpected going on that is arm64 specific and we should try and figure out what that is.

EG Here's the windows x64 results for that same test over the same time interval.

@pentp if you can't get diffs using the above let me know and I'll actually go try it myself to make sure I didn't leave out anything important.

AndyAyersMS · 2021-08-12T16:41:21Z

Marking as 6.0 at least until we've better understood what's going on.

tannergooding · 2021-08-12T17:01:54Z

I wonder if its partially due to certain operations, such as integer division, being significantly more expensive on x86/x64 (as you only have div which computes the remainder as well).

On A64, the benefits of avoiding div are lessened and some of the algorithms that avoid div might expect you can have things like lea instead.

EgorBo · 2021-08-12T17:16:58Z

@AndyAyersMS can it also be PGO related? dc7598e

pentp · 2021-08-12T19:59:57Z

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V
I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it's only used once and it should still be faster than a bunch of lsr/sub/add instructions...

AndyAyersMS · 2021-08-12T21:38:51Z

I think these are running on Surface Pro X machines, which are A76 (big).

@echesakov can you ask the Arm folks to take a look and see if they have any ideas?

echesakov · 2021-08-12T22:08:50Z

@TamarChristinaArm Do you think you or someone on your team would be able to help us with the issue?

TamarChristinaArm · 2021-08-13T09:48:32Z

@TamarChristinaArm Do you think you or someone on your team would be able to help us with the issue?

Sure thing!, taking a look.

TamarChristinaArm · 2021-08-13T10:12:34Z

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V
I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

I think this is correct culprid, the loop has a long dependency chain off the value out of the mul/umull with little independent instructions so it's bound by these two and 64-bit mul has twice the latency of umull.

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it's only used once and it should still be faster than a bunch of lsr/sub/add instructions...

64-bit mul has a latency of 4(3) and umull has 2(1) on Cortex-A76. Were you perhaps comparing the AArch32 instructions? Note that on AArch64 mul and umull aren't real instructions but are architectural aliases for madd and umaddl respectively so you need to look at the latencies for those.

The mul seems to be unneeded though, both inputs are 32-bits so you should be able to just use umull there as before.

TamarChristinaArm · 2021-08-16T11:34:08Z

As a side note, while looking at the diff I noticed some other general improvements that could be made in the future:

sxtw    x3, w6	
lsl     x3, x3, #1	
add     x2, x2, x3

can become

add     x2, x2, w6, sxtw, #1

and

lsl     w3, w3, #1
sub     w3, w0, w3

can become

sub     w3, w0, w3, lsl 1

lastly

sub     x2, x2, #2
strh    w1, [x2]

can become

strh    w1, [x2, #-2]!

and probably want to move the constant out of the loop.

DrewScoggins added arch-arm64 os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Jun 14, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jun 14, 2021

DrewScoggins changed the title ~~[Perf] Changes at 5/18/2021 5:35:47 PM~~ [Perf] Regressions in Integer Formatting Jun 14, 2021

jeffschwMSFT added the area-System.Runtime label Jun 15, 2021

tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021

tannergooding added this to the Future milestone Jul 12, 2021

danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Runtime labels Aug 11, 2021

danmoseley removed this from the Future milestone Aug 11, 2021

danmoseley added the untriaged New issue has not been triaged by the area owner label Aug 11, 2021

JulieLeeMSFT assigned AndyAyersMS Aug 11, 2021

JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Aug 11, 2021

danmoseley mentioned this issue Aug 12, 2021

Faster unsigned division by constants #52893

Merged

AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 12, 2021

AndyAyersMS added this to the 6.0.0 milestone Aug 12, 2021

pentp mentioned this issue Aug 13, 2021

Fix ARM64 unsigned div by const perf regression #57372

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 13, 2021

pentp mentioned this issue Aug 14, 2021

Fix ARM64 unsigned div by const perf regression #57400

Merged

echesakov closed this as completed in #57400 Aug 17, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 17, 2021

ghost locked as resolved and limited conversation to collaborators Sep 16, 2021

[Perf] Regressions in Integer Formatting #54166

[Perf] Regressions in Integer Formatting #54166

Comments

DrewScoggins commented Jun 14, 2021 • edited Loading

Run Information

Regressions in System.Tests.Perf_Int32

Repro

Payloads

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)

System.Tests.Perf_Int32.TryFormat(value: 2147483647)

Docs

Run Information

Regressions in System.Tests.Perf_UInt64

Repro

Payloads

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)

Docs

Run Information

Regressions in System.Tests.Perf_Version

Repro

Payloads

Histogram

System.Tests.Perf_Version.TryFormatL

Docs

Run Information

Regressions in System.Tests.Perf_UInt32

Repro

Payloads

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)

System.Tests.Perf_UInt32.TryFormat(value: 4294967295)

Docs

dotnet-issue-labeler bot commented Jun 14, 2021

ghost commented Jun 15, 2021

Run Information

Regressions in System.Tests.Perf_Int32

Repro

Payloads

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)

System.Tests.Perf_Int32.TryFormat(value: 2147483647)

Docs

Run Information

Regressions in System.Tests.Perf_UInt64

Repro

Payloads

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)

Docs

Run Information

Regressions in System.Tests.Perf_Version

Repro

Payloads

Histogram

System.Tests.Perf_Version.TryFormatL

Docs

Run Information

Regressions in System.Tests.Perf_UInt32

Repro

Payloads

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)

System.Tests.Perf_UInt32.TryFormat(value: 4294967295)

Docs

danmoseley commented Aug 10, 2021 • edited Loading

tannergooding commented Aug 10, 2021

stephentoub commented Aug 10, 2021

pentp commented Aug 11, 2021

danmoseley commented Aug 11, 2021

danmoseley commented Aug 11, 2021

JulieLeeMSFT commented Aug 11, 2021

AndyAyersMS commented Aug 12, 2021

JulieLeeMSFT commented Aug 12, 2021

AndyAyersMS commented Aug 12, 2021

AndyAyersMS commented Aug 12, 2021

tannergooding commented Aug 12, 2021

EgorBo commented Aug 12, 2021

pentp commented Aug 12, 2021 • edited Loading

DrewScoggins commented Jun 14, 2021 •

edited

Loading

danmoseley commented Aug 10, 2021 •

edited

Loading

pentp commented Aug 12, 2021 •

edited

Loading