Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize Enumerable.Range(...).ToArray() #80633

Closed

Conversation

neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented Jan 13, 2023

Loop body is almost the same to what LLVM produces for Rust's let vec: Vec<i32> = (0..count).collect() with x86-64-v3 target except the unroll factor it chooses is 4 while here we pick 2 instead because it saturates store throughput on Zen 3 anyways and has much smaller codegen size.

NB: on ARM64 it emits a pair of strs vs LLVM's single stp but I'm not sure if that's an optimization .NET can do today out of two adjacent Unsafe.WriteUnaligned.

Codegen (.NET 8 daily build + FullPGO):

; BenchFixture.Enumerable2+RangeIterator.InitializeSpanCore(System.Span`1<Int32>)
       vzeroupper
       mov       rax,[rdx]
       mov       edx,[rdx+8]
       mov       r8d,edx
       sar       r8d,1F
       and       r8d,0F
       add       r8d,edx
       and       r8d,0FFFFFFF0
       mov       r9d,edx
       sub       r9d,r8d
       vmovupd   ymm0,[7FFDC3F91140]
       vpbroadcastd ymm1,dword ptr [rcx+8]
       mov       r8,1457FC040C8
       mov       r8,[r8]
       vpaddd    ymm1,ymm1,[r8+8]
       vpaddd    ymm2,ymm1,[7FFDC3F91160]
       mov       r8d,edx
       sub       r8d,r9d
       movsxd    r8,r8d
       lea       r8,[rax+r8*4]
       cmp       rax,r8
       je        short M03_L01
       nop       dword ptr [rax]
M03_L00:
       vmovupd   [rax],ymm1
       vmovupd   [rax+20],ymm2
       vpaddd    ymm1,ymm1,ymm0
       vpaddd    ymm2,ymm2,ymm0
       add       rax,40
       cmp       rax,r8
       jne       short M03_L00
M03_L01:
       sub       edx,r9d
       mov       r8d,[rcx+8]
       add       edx,r8d
       add       r8d,[rcx+0C]
       cmp       edx,r8d
       jge       short M03_L03
M03_L02:
       mov       [rax],edx
       add       rax,4
       inc       edx
       cmp       edx,r8d
       jl        short M03_L02
M03_L03:
       vzeroupper
       ret
; Total bytes of code 158

This is a port of https://github.com/neon-sunset/RangeExtensions/blob/main/RangeExtensions/RangeEnumerable.ICollection.cs#L144 except it doesn't need to support bi-directional scenario.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 13, 2023
@ghost
Copy link

ghost commented Jan 13, 2023

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: neon-sunset
Assignees: -
Labels:

area-System.Linq, community-contribution

Milestone: -

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 13, 2023

Numbers (there is sadly a lot of variability on my system so please consider median too):

BenchmarkDotNet=v0.13.3, OS=Windows 11 (10.0.22621.963)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23062.10
  [Host]     : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2


|     Method | Value |         Mean |      Error |     StdDev |       Median | Ratio | RatioSD | Code Size |
|----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|--------:|----------:|
| ToArrayOld |    10 |     7.160 ns |  0.2435 ns |  0.6748 ns |     7.395 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew |    10 |     9.170 ns |  0.4456 ns |  1.3140 ns |     8.591 ns |  1.29 |    0.21 |     220 B |
|            |       |              |            |            |              |       |         |           |
| ToArrayOld |   100 |    38.979 ns |  0.8190 ns |  2.0395 ns |    38.957 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew |   100 |    27.410 ns |  0.5920 ns |  1.5802 ns |    27.395 ns |  0.71 |    0.06 |     363 B |
|            |       |              |            |            |              |       |         |           |
| ToArrayOld | 10000 | 3,298.214 ns | 60.8086 ns | 81.1778 ns | 3,276.112 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew | 10000 |   644.097 ns | 12.3432 ns | 26.8330 ns |   640.007 ns |  0.20 |    0.01 |     390 B |

(data is for direct ToArray cost without interface calls and RangeIterator allocation)

@stephentoub
Copy link
Member

Thanks, @neon-sunset, for working on this. I've avoided doing this kind of vectorization in the past because I couldn't find any meaningful use outside of tests, and the impact on maintainability / size wasn't worth it. Can you share any information on where you see this optimization benefiting consumers? Any examples you can point to? Thanks!

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 13, 2023

Can you share any information on where you see this optimization benefiting consumers?

Sorry, I cannot. The initial implementation was done simply because .NET numbers weren't competitive with Rust or C++ and it was fun to iterate with BDN assembly output to achieve parity.

As far as usages in dotnet/runtime go, I do not see an easy way to locate all occurrences where an array is manually initialized to a range of values while most usages of Enumerable.Range itself seem to be distributed in tests just like you mentioned.

Also, List initialization can still be improved even without vectorization through providing ICollection<T> stub or allowing List<T> constructor to resolve IListProvider<T>.

@stephentoub
Copy link
Member

stephentoub commented Jan 16, 2023

https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.ToArray&regexp=true&case=true

Yes. Now which of those aren't:

  • In a test, and/or
  • In a sample, and/or
  • So small a range that vectorization doesn't help or even hurt, and/or
  • On a use-once or only a few times code path

?

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 17, 2023

Pushed a fix for impl. and feedback. Will close the PR since a change is deemed inappropriate for System.LINQ after the tests pass so that someone else can pick it up if it is ever reconsidered in the future.

Will also open a separate issue because materialization to List<T> can be sped up quite a bit by resolving IListProvider<T> and initializing an array used to construct the list directly.

@stephentoub
Copy link
Member

since a change is deemed inappropriate

To be clear, I'd be happy if we could demonstrate this was valuable. I just haven't seen enough evidence of that yet.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 17, 2023

There are a few interesting places that initialize arrays directly:

and last but not least https://grep.app/search?q=%5Bi%5D%20%3D%20i%3B&case=true&filter[lang][0]=C%23&filter[repo][0]=mathnet/mathnet-numerics&filter[path][0]=src/Numerics/

Still, Enumerable.Range itself is mostly used in tests for initializing short arrays. I agree with your initial comment that merging the PR is of questionable profitability - most perf-sensitive locations that could benefit from such change would be then hurt by the cost of virtual calls and RangeIterator allocation or may just outright not use System.Linq because it is usually a performance concern, not an improvement.

If you don't think it's worth keeping around or moving the implementation to some internal helper method, feel free to close. Thanks!

@neon-sunset neon-sunset marked this pull request as ready for review January 18, 2023 00:18
Comment on lines 91 to 96
int end = _end;
ref int pos = ref MemoryMarshal.GetReference(destination);
for (int cur = _start; cur < end; cur++)
{
pos = cur;
pos = ref Unsafe.Add(ref pos, 1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need unsafe code here to avoid bounds checks.

int cur = _start;
for (int i = 0; i < destination.Length; i++)
{
    destination[i] = cur;
    cur++;
}

@@ -84,6 +82,62 @@ public int TryGetLast(out bool found)
found = true;
return _end - 1;
}

// Destination *must* be non-empty and exactly match the range length
private void InitializeSpan(Span<int> destination)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be inlined. I'd expect this to regress for very short ranges due to the extra method call that wasn't there before.

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this pull request Jan 22, 2023
stephentoub pushed a commit that referenced this pull request Jan 23, 2023
* Remove `RangeIterator.ToArray` bounds check

#80633 (comment)

* Eliminate additional `ToArray` bounds checks
mdh1418 pushed a commit to mdh1418/runtime that referenced this pull request Jan 24, 2023
* Remove `RangeIterator.ToArray` bounds check

dotnet#80633 (comment)

* Eliminate additional `ToArray` bounds checks
@xtqqczze
Copy link
Contributor

@neon-sunset Could you report new BenchmarkDotNet results after #81001?

@jeffhandley
Copy link
Member

Hi @neon-sunset. Thanks for getting this contribution going. There are a couple of feedback comments that still need to be addressed, and there's a conflict that has emerged too. I'm going to mark this as needs-author-action to await your next iteration. Let us know if you need assistance moving it forward at this point though.

@jeffhandley jeffhandley added the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 14, 2023
@neon-sunset
Copy link
Contributor Author

Hi @jeffhandley Thanks, I will address feedback later this week.

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 15, 2023
@neon-sunset neon-sunset reopened this Feb 24, 2023
@neon-sunset neon-sunset changed the title Vectorize Enumerable.Range(...).ToArray() [WIP] Vectorize Enumerable.Range(...).ToArray() Feb 24, 2023
@neon-sunset
Copy link
Contributor Author

Failures seem to be unrelated, I think?
Replacing GC.AllocateUninitializedArray with simple new int[] fixes regression on small inputs. In fact, it does not seem to be profitable at all. AVX2 improvement should be greater, but I cannot test this because my other device has died.

Below numbers are for direct ToArray() calls. On short lengths, the cost of interface calls and iterator allocation dwarfs everything else :(

BenchmarkDotNet=v0.13.3, OS=macOS 13.2.1 (22D68) [Darwin 22.3.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK=8.0.100-preview.2.23121.17
  [Host]        : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
  DefaultJob    : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
  NativeAOT 8.0 : .NET 8.0.0-preview.2.23116.1, Arm64 NativeAOT AdvSIMD

|        Method |           Job |       Runtime |  Length |          Mean |       Error |      StdDev | Ratio |
|-------------- |-------------- |-------------- |-------- |--------------:|------------:|------------:|------:|
|       ToArray |    DefaultJob |      .NET 8.0 |       1 |      3.025 ns |   0.0167 ns |   0.0156 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |       1 |      3.066 ns |   0.0082 ns |   0.0069 ns |  1.01 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |       1 |      3.294 ns |   0.0119 ns |   0.0111 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |       1 |      3.504 ns |   0.0036 ns |   0.0028 ns |  1.07 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |       5 |      5.767 ns |   0.0210 ns |   0.0196 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |       5 |      5.991 ns |   0.0241 ns |   0.0226 ns |  1.04 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |       5 |      5.707 ns |   0.0130 ns |   0.0121 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |       5 |      5.475 ns |   0.0175 ns |   0.0155 ns |  0.96 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |      10 |      7.837 ns |   0.0246 ns |   0.0230 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |      10 |      5.552 ns |   0.0199 ns |   0.0176 ns |  0.71 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |      10 |      9.208 ns |   0.0270 ns |   0.0253 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |      10 |      5.989 ns |   0.0116 ns |   0.0108 ns |  0.65 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |     100 |     50.587 ns |   0.0668 ns |   0.0592 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |     100 |     23.730 ns |   0.0718 ns |   0.0600 ns |  0.47 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |     100 |     45.158 ns |   0.1053 ns |   0.0985 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |     100 |     21.474 ns |   0.0487 ns |   0.0407 ns |  0.48 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |  100000 | 46,401.337 ns | 636.9110 ns | 595.7669 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |  100000 | 23,084.711 ns | 350.5279 ns | 327.8841 ns |  0.50 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |  100000 | 63,969.073 ns | 182.6120 ns | 170.8154 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |  100000 | 40,710.868 ns | 148.9555 ns | 124.3846 ns |  0.64 |

@neon-sunset neon-sunset requested review from EgorBo and stephentoub and removed request for EgorBo February 24, 2023 12:08
@neon-sunset neon-sunset changed the title [WIP] Vectorize Enumerable.Range(...).ToArray() Vectorize Enumerable.Range(...).ToArray() Feb 24, 2023
@xtqqczze
Copy link
Contributor

Replacing GC.AllocateUninitializedArray with simple new int[] fixes regression on small inputs.

I'm surprised this is measurable, since AllocateUninitializedArray<T> already uses new[] for small arrays:

// small arrays are allocated using `new[]` as that is generally faster.
#pragma warning disable 8500 // sizeof of managed types
if (length < 2048 / sizeof(T))
#pragma warning restore 8500
{
return new T[length];
}

@xtqqczze
Copy link
Contributor

Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores

So the best case ratio is 0.47, presumably with much larger code size.

@jeffhandley
Copy link
Member

@krwq What do you think is the right path forward for this PR?

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Mar 27, 2023

@jeffhandley I will take another go at the change to make it more similar to how SpanHelpers.IndexOf* is implemented to get rid of the regression on small lengths (M1 hides it completely but Zen 3 does not). Currently, on win-x64, it regresses by 20% (plus extra 1-2ns) at ranges like 5 but improves by about 3.6x at lengths 1000+.

@jeffhandley jeffhandley added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 27, 2023
@xtqqczze
Copy link
Contributor

@neon-sunset Might be worth a read: #84115.

@ghost
Copy link

ghost commented Apr 18, 2023

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@ghost
Copy link

ghost commented May 3, 2023

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

@ghost ghost closed this May 3, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 2, 2023
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq community-contribution Indicates that the PR has been added by a community member needs-author-action An issue or pull request that requires more info or actions from the author. no-recent-activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants