Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major performance regression of the array initialization pattern #62864

Closed
adamsitnik opened this issue Jul 22, 2022 · 17 comments · Fixed by #62868
Closed

Major performance regression of the array initialization pattern #62864

adamsitnik opened this issue Jul 22, 2022 · 17 comments · Fixed by #62868

Comments

@adamsitnik
Copy link
Member

In dotnet/performance#2534 I've bumped BDN version to see if @MichalStrehovsky fix (dotnet/BenchmarkDotNet#2046) helped. It did, but new issue popped out.

86 out of 4334 benchmarks that were working fine in the past (2 months ago was the last time I run them), are now failing with PlatformNotSupportedException:

 System.PlatformNotSupportedException: Operation is not supported on this platform.
    at Internal.Runtime.CompilerHelpers.ThrowHelpers.ThrowPlatformNotSupportedException() + 0x24
    at System.Net.Primitives.Tests.IPAddressPerformanceTests.<ByteAddresses>d__0.MoveNext() + 0x45
    at System.Linq.Enumerable.<OfTypeIterator>d__65`1.MoveNext() + 0x6b
    at System.Linq.Enumerable.MaxInteger[TSource, TResult](IEnumerable`1, Func`2) + 0x48
    at System.Net.Primitives.Tests.IPAddressPerformanceTests..ctor() + 0x9d
    at BenchmarkDotNet.Autogenerated.Runnable_1..ctor() + 0x2a
    at BenchmarkDotNet.Autogenerated.Runnable_1.Run(IHost, String) + 0x6e
    at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[]) + 0x1a2

for code that seems to be very simple:

https://github.com/dotnet/performance/blob/062e33e11c917ce408ce4b447552185d6565b035/src/benchmarks/micro/libraries/System.Net.Primitives/IPAddressPerformanceTests.cs#L17-L26

I have moved this code to a standalone app and was not able to repro the issue.

Repro steps:

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f nativeaot7.0 --filter System.Net.Primitives.Tests.IPAddressPerformanceTests.Ctor_Span --bdn-arguments "--keepFiles true"

BDN reports the path where to project is stored (the last path in the line below):

// start dotnet  build -c Release -r win-x64 --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 /p:Deterministic=true /p:Optimize=true /p:NuGetPackageRoot="D:\projects\performance\artifacts\packages" in D:\projects\performance\artifacts\bin\MicroBenchmarks\Release\net7.0\Job-NYBBDR

cc @jkotas

@ghost ghost added the untriaged Issues and PRs which have not yet been triaged by a lead label Jul 22, 2022
@MichalStrehovsky
Copy link
Member

The IL for the method in question looks like this - there's a GC::AllocateUninitializedArray instead of newarr. This is breaking pattern recognition in RyuJIT. In NativeAOT this is a PlatformNotSupported because RuntimeHelpers::InitializeArray needs to be expanded as an intrinsic. In CoreCLR, breaking the pattern recognition is a deoptimization. Good thing we found it.

Cc @jaredpar to make sure this is intentional new codegen for this.

RyuJIT will need to be adapted to recognize the new pattern.

.method private final hidebysig newslot virtual 
	instance bool MoveNext () cil managed 
{
	.override method instance bool [System.Runtime]System.Collections.IEnumerator::MoveNext()
	// Method begins at RVA 0x701e8
	// Header size: 12
	// Code size: 117 (0x75)
	.maxstack 4
	.locals init (
		[0] int32
	)

	IL_0000: ldarg.0
	IL_0001: ldfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0006: stloc.0
	IL_0007: ldloc.0
	IL_0008: switch (IL_001b, IL_0043, IL_006c)

	IL_0019: ldc.i4.0
	IL_001a: ret

	IL_001b: ldarg.0
	IL_001c: ldc.i4.m1
	IL_001d: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0022: ldarg.0
	IL_0023: ldc.i4.4
	IL_0024: ldc.i4.0
	IL_0025: call !!0[] [System.Runtime]System.GC::AllocateUninitializedArray<uint8>(int32, bool)
	IL_002a: dup
	IL_002b: ldtoken field int32 '<PrivateImplementationDetails>'::A58DB492CDCA7F4880696F0FD6743AE3D8F89D85DF07A95D10C4DD82D3C5881A
	IL_0030: call void [System.Runtime]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [System.Runtime]System.Array, valuetype [System.Runtime]System.RuntimeFieldHandle)
	IL_0035: stfld object System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>2__current'
	IL_003a: ldarg.0
	IL_003b: ldc.i4.1
	IL_003c: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0041: ldc.i4.1
	IL_0042: ret

	IL_0043: ldarg.0
	IL_0044: ldc.i4.m1
	IL_0045: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_004a: ldarg.0
	IL_004b: ldc.i4.s 16
	IL_004d: ldc.i4.0
	IL_004e: call !!0[] [System.Runtime]System.GC::AllocateUninitializedArray<uint8>(int32, bool)
	IL_0053: dup
	IL_0054: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=16' '<PrivateImplementationDetails>'::E531DA00E6A1B6D170859CF4F00DF857B1338B4950DB57E2B81B6D7A2ACD8981
	IL_0059: call void [System.Runtime]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [System.Runtime]System.Array, valuetype [System.Runtime]System.RuntimeFieldHandle)
	IL_005e: stfld object System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>2__current'
	IL_0063: ldarg.0
	IL_0064: ldc.i4.2
	IL_0065: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_006a: ldc.i4.1
	IL_006b: ret

	IL_006c: ldarg.0
	IL_006d: ldc.i4.m1
	IL_006e: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0073: ldc.i4.0
	IL_0074: ret
} // end of method '<ByteAddresses>d__0'::MoveNext

@ghost
Copy link

ghost commented Jul 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

In dotnet/performance#2534 I've bumped BDN version to see if @MichalStrehovsky fix (dotnet/BenchmarkDotNet#2046) helped. It did, but new issue popped out.

86 out of 4334 benchmarks that were working fine in the past (2 months ago was the last time I run them), are now failing with PlatformNotSupportedException:

 System.PlatformNotSupportedException: Operation is not supported on this platform.
    at Internal.Runtime.CompilerHelpers.ThrowHelpers.ThrowPlatformNotSupportedException() + 0x24
    at System.Net.Primitives.Tests.IPAddressPerformanceTests.<ByteAddresses>d__0.MoveNext() + 0x45
    at System.Linq.Enumerable.<OfTypeIterator>d__65`1.MoveNext() + 0x6b
    at System.Linq.Enumerable.MaxInteger[TSource, TResult](IEnumerable`1, Func`2) + 0x48
    at System.Net.Primitives.Tests.IPAddressPerformanceTests..ctor() + 0x9d
    at BenchmarkDotNet.Autogenerated.Runnable_1..ctor() + 0x2a
    at BenchmarkDotNet.Autogenerated.Runnable_1.Run(IHost, String) + 0x6e
    at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[]) + 0x1a2

for code that seems to be very simple:

https://github.com/dotnet/performance/blob/062e33e11c917ce408ce4b447552185d6565b035/src/benchmarks/micro/libraries/System.Net.Primitives/IPAddressPerformanceTests.cs#L17-L26

I have moved this code to a standalone app and was not able to repro the issue.

Repro steps:

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f nativeaot7.0 --filter System.Net.Primitives.Tests.IPAddressPerformanceTests.Ctor_Span --bdn-arguments "--keepFiles true"

BDN reports the path where to project is stored (the last path in the line below):

// start dotnet  build -c Release -r win-x64 --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 /p:Deterministic=true /p:Optimize=true /p:NuGetPackageRoot="D:\projects\performance\artifacts\packages" in D:\projects\performance\artifacts\bin\MicroBenchmarks\Release\net7.0\Job-NYBBDR

cc @jkotas

Author: adamsitnik
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@ghost ghost removed the untriaged Issues and PRs which have not yet been triaged by a lead label Jul 22, 2022
@jakobbotsch
Copy link
Member

This is breaking pattern recognition in RyuJIT.

Specifically in this function:
https://github.com/dotnet/runtime/blob/a5e678dd58c365764d0889e137f5953533dbb451/src/coreclr/jit/importer.cpp#L3123-L3133

@MichalStrehovsky
Copy link
Member

Yep, looks like intentional: #62392

The perf repo uses a more bleeding edge Roslyn than we do.

@adamsitnik
Copy link
Member Author

there's a GC::AllocateUninitializedArray instead of newarr.

I remember trying to use GC::AllocateUninitializedArray in a few places and realizing that it was often slower than newarr. It even has a fallback depending on size:

https://github.com/dotnet/runtime/blob/5391db5fc3a0282ae30c40929895fa61a4a70976/src/coreclr/System.Private.CoreLib/src/System/GC.CoreCLR.cs#L683-L687

I wonder if it's going to cause perf regressions as well.

@MichalStrehovsky
Copy link
Member

Mono also does the pattern match optimization for it. I don't know if there are any runtimes that would benefit from this for InitializeArray since they just rewrite the whole sequence into a more efficient thing anyway.

Maybe a better fix would be for Roslyn not to do this?

@kasperk81
Copy link

@adamperlin @333fred reverting #62392 will fix this issue and performance penalty. it is a deoptimization. AllocateUninitializedArray has worse codegen compared to new[] dotnet/runtime#65999 (comment).

@adamsitnik
Copy link
Member Author

cc @VSadov

@marek-safar
Copy link
Contributor

/cc @lambdageek @vargaz

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged Issues and PRs which have not yet been triaged by a lead label Jul 22, 2022
@jkotas jkotas transferred this issue from dotnet/runtime Jul 22, 2022
@jkotas jkotas removed untriaged Issues and PRs which have not yet been triaged by a lead area-CodeGen-coreclr labels Jul 22, 2022
@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

Please revert #62392 ASAP.

It breaks the pattern matching done by the JIT for array initialization. It will result in major performance regression on most runtimes and it breaks NativeAOT that does not support non-standard array initialization patterns currently.

@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

Quick standalone performance test:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;

Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 100_000_000; i++)
    CreateInitializedArray();
Console.WriteLine(sw.ElapsedMilliseconds);

[MethodImpl(MethodImplOptions.NoInlining)]
static int[] CreateInitializedArray()
{
    return new int[] { 1, 2, 3, 4 };
}

Run using:

set DOTNET_TieredCompilation=0
dotnet run -c Release`

.NET 7 SDK Preview 6: 390ms
Bleeding edge Roslyn (4.4.0-1.22371.1): 10899ms

As you can see, this change introduced 27x regression of the simple array initialization pattern used in this micro-benchmark.

BTW: I do not see any performance numbers in #62392 . In dotnet/runtime, we require performance numbers demonstrating the improvement for all performance optimization PRs. It is not unusual for performance optimizations to not have the intended effect.

(EDIT: Corrected the perf numbers.)

@jkotas jkotas changed the title [NativeAOT] Unexpected PlatformNotSupportedException thrown Major performance regression of the array initialization pattern Jul 22, 2022
@tannergooding
Copy link
Member

As you can see, this change introduced 27x regression of the simple array initialization pattern used in this micro-benchmark.

Even with Roslyn reverting the change, we should likely have an issue in the dotnet/runtime repo tracking fixing this on our end. There is no reason that GC.AllocateUninitializedArray for constant inputs can't be replaced with a call to newarr to ensure there is no regression for "small inputs". That would remove an existing "pit of failure" from an API that looks performance oriented. However, additionally, there is really no reason why we couldn't also optimize GC.AllocateUninitializedArray for non-constant length but constant pinned = false to also have the correct fast paths so that there is no real perf difference.

Someone just needs to do the work to ensure the relevant handling exists and to ensure that the JIT can properly optimize the code. The actual method is "large" today and isn't inline friendly, even though the checks being done are effectively all constant. For a non-constant length, this function should really be getting inlined as: if (length < 2048 / sizeof(T)) { return new T[length]; } else { return AllocateNewArray(...); } -or- even better, have a CORINFO_HELP_NEWARR_* API to ensure that this is really just newarr with a "should I zero or not" flags check (particularly since the "failure" cases are really just dependent on whether T is a reference type or contains references, which is always known at JIT time).

@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

Yes, we can do all that, but the benefits are vanishingly small.

GC.AllocateUninitializedArray is also a security trap due to uninitialized data. I kind of like that the API is a performance trap since it forces people to use it only in places where it is actually needed.

@kasperk81
Copy link

kasperk81 commented Jul 22, 2022

the original api proposal issue is still active. AllocateUninitializedArray is only profitable for larger arrays dotnet/runtime#27146 (comment). we should have an analyzer to clarify it, overall its name is very confusing and it can easily lead developers to think it's pro performance.

@tannergooding
Copy link
Member

Yes, we can do all that, but the benefits are vanishingly small.

Part of this is the existing overhead. For large allocations, the costs can be fairly visible and there are cases where those are frequent enough or where they are happening in contexts where the costs do matter.

GC.AllocateUninitializedArray is also a security trap due to uninitialized data.

This is no more a trap than NativeMemory.Alloc and while NativeMemory.Alloc is often a suitable choice, there are many scenarios where it is not.

I kind of like that the API is a performance trap since it forces people to use it only in places where it is actually needed.

I don't think that a performance oriented API should be a performance trap. If we want people to be aware of the potential pitfalls of it, that's something that [UnsafeCallersOnly] or similar should be for. As the API stands today, it has paths in it to handle the small case but to use it "correctly" (even in cases where it is actually needed), the devs realistically need to duplicate those same checks.

I'm all for telling people "this is dangerous, be mindful of the holes". However, I don't think we should be intentionally making it more problematic to use correctly where there is a decently simple fix that can be made.

@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

Submitted doc update at dotnet/dotnet-api-docs#8249

@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

use it "correctly" (even in cases where it is actually needed), the devs realistically need to duplicate those same checks.

I disagree. We are using it correctly in our own CoreLib in StringBuilder, ArrayPool and number of other places. Number of these places were heavily micro-benchmarked when the uses of this API were introduced to make sure that there is no regression, and none of them have the checks duplicated.

there is a decently simple fix that can be made.

I do not think that the fix is simple. The GC is heavily optimized for giving you zero-initialized memory. If we were to build uninitialized memory path that is as optimized as the existing zero-initialized path, it would be fairly complex change, and it would be hard to make it pay-for-play.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants