-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize Enumerable.Range(...).ToArray() #80633
Vectorize Enumerable.Range(...).ToArray() #80633
Conversation
Tagging subscribers to this area: @dotnet/area-system-linq Issue Detailsnull
|
Numbers (there is sadly a lot of variability on my system so please consider median too):
(data is for direct |
Thanks, @neon-sunset, for working on this. I've avoided doing this kind of vectorization in the past because I couldn't find any meaningful use outside of tests, and the impact on maintainability / size wasn't worth it. Can you share any information on where you see this optimization benefiting consumers? Any examples you can point to? Thanks! |
Sorry, I cannot. The initial implementation was done simply because .NET numbers weren't competitive with Rust or C++ and it was fun to iterate with BDN assembly output to achieve parity. As far as usages in Also, List initialization can still be improved even without vectorization through providing |
Yes. Now which of those aren't:
? |
Pushed a fix for impl. and feedback. Will close the PR since a change is deemed inappropriate for Will also open a separate issue because materialization to |
To be clear, I'd be happy if we could demonstrate this was valuable. I just haven't seen enough evidence of that yet. |
There are a few interesting places that initialize arrays directly:
and last but not least https://grep.app/search?q=%5Bi%5D%20%3D%20i%3B&case=true&filter[lang][0]=C%23&filter[repo][0]=mathnet/mathnet-numerics&filter[path][0]=src/Numerics/ Still, If you don't think it's worth keeping around or moving the implementation to some internal helper method, feel free to close. Thanks! |
int end = _end; | ||
ref int pos = ref MemoryMarshal.GetReference(destination); | ||
for (int cur = _start; cur < end; cur++) | ||
{ | ||
pos = cur; | ||
pos = ref Unsafe.Add(ref pos, 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need unsafe code here to avoid bounds checks.
int cur = _start;
for (int i = 0; i < destination.Length; i++)
{
destination[i] = cur;
cur++;
}
@@ -84,6 +82,62 @@ public int TryGetLast(out bool found) | |||
found = true; | |||
return _end - 1; | |||
} | |||
|
|||
// Destination *must* be non-empty and exactly match the range length | |||
private void InitializeSpan(Span<int> destination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be inlined. I'd expect this to regress for very short ranges due to the extra method call that wasn't there before.
* Remove `RangeIterator.ToArray` bounds check #80633 (comment) * Eliminate additional `ToArray` bounds checks
* Remove `RangeIterator.ToArray` bounds check dotnet#80633 (comment) * Eliminate additional `ToArray` bounds checks
@neon-sunset Could you report new BenchmarkDotNet results after #81001? |
Hi @neon-sunset. Thanks for getting this contribution going. There are a couple of feedback comments that still need to be addressed, and there's a conflict that has emerged too. I'm going to mark this as |
Hi @jeffhandley Thanks, I will address feedback later this week. |
296f2bc
to
4ca8590
Compare
Failures seem to be unrelated, I think? Below numbers are for direct BenchmarkDotNet=v0.13.3, OS=macOS 13.2.1 (22D68) [Darwin 22.3.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK=8.0.100-preview.2.23121.17
[Host] : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
NativeAOT 8.0 : .NET 8.0.0-preview.2.23116.1, Arm64 NativeAOT AdvSIMD
| Method | Job | Runtime | Length | Mean | Error | StdDev | Ratio |
|-------------- |-------------- |-------------- |-------- |--------------:|------------:|------------:|------:|
| ToArray | DefaultJob | .NET 8.0 | 1 | 3.025 ns | 0.0167 ns | 0.0156 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 1 | 3.066 ns | 0.0082 ns | 0.0069 ns | 1.01 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 1 | 3.294 ns | 0.0119 ns | 0.0111 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 1 | 3.504 ns | 0.0036 ns | 0.0028 ns | 1.07 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 5 | 5.767 ns | 0.0210 ns | 0.0196 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 5 | 5.991 ns | 0.0241 ns | 0.0226 ns | 1.04 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 5 | 5.707 ns | 0.0130 ns | 0.0121 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 5 | 5.475 ns | 0.0175 ns | 0.0155 ns | 0.96 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 10 | 7.837 ns | 0.0246 ns | 0.0230 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 10 | 5.552 ns | 0.0199 ns | 0.0176 ns | 0.71 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 10 | 9.208 ns | 0.0270 ns | 0.0253 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 10 | 5.989 ns | 0.0116 ns | 0.0108 ns | 0.65 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 100 | 50.587 ns | 0.0668 ns | 0.0592 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 100 | 23.730 ns | 0.0718 ns | 0.0600 ns | 0.47 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 100 | 45.158 ns | 0.1053 ns | 0.0985 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 100 | 21.474 ns | 0.0487 ns | 0.0407 ns | 0.48 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 100000 | 46,401.337 ns | 636.9110 ns | 595.7669 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 100000 | 23,084.711 ns | 350.5279 ns | 327.8841 ns | 0.50 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 100000 | 63,969.073 ns | 182.6120 ns | 170.8154 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 100000 | 40,710.868 ns | 148.9555 ns | 124.3846 ns | 0.64 | |
I'm surprised this is measurable, since runtime/src/coreclr/System.Private.CoreLib/src/System/GC.CoreCLR.cs Lines 658 to 664 in 7829252
|
So the best case ratio is 0.47, presumably with much larger code size. |
@krwq What do you think is the right path forward for this PR? |
@jeffhandley I will take another go at the change to make it more similar to how |
@neon-sunset Might be worth a read: #84115. |
This pull request has been automatically marked |
This pull request will now be closed since it had been marked |
Loop body is almost the same to what LLVM produces for Rust's
let vec: Vec<i32> = (0..count).collect()
withx86-64-v3
target except the unroll factor it chooses is 4 while here we pick 2 instead because it saturates store throughput on Zen 3 anyways and has much smaller codegen size.NB: on ARM64 it emits a pair of
str
s vs LLVM's singlestp
but I'm not sure if that's an optimization .NET can do today out of two adjacentUnsafe.WriteUnaligned
.Codegen (.NET 8 daily build + FullPGO):
This is a port of https://github.com/neon-sunset/RangeExtensions/blob/main/RangeExtensions/RangeEnumerable.ICollection.cs#L144 except it doesn't need to support bi-directional scenario.