From ce1840f3b19d4cdf83c241fd3970a0c540007da0 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Thu, 30 Mar 2023 07:12:23 +0200 Subject: [PATCH 01/14] Introduction to vectorization with Vector128 and Vector256 --- .../vectorization-guidelines.md | 1060 +++++++++++++++++ 1 file changed, 1060 insertions(+) create mode 100644 docs/coding-guidelines/vectorization-guidelines.md diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md new file mode 100644 index 00000000000000..394b91af3ba0c5 --- /dev/null +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -0,0 +1,1060 @@ +- [Introduction to vectorization with Vector128 and Vector256](#introduction-to-vectorization-with-vector128-and-vector256) + * [Code structure](#code-structure) + + [Testing](#testing) + + [Benchmarking](#benchmarking) + - [Custom config](#custom-config) + - [Memory alignment](#memory-alignment) + * [Enforcing memory alignment](#enforcing-memory-alignment) + * [Memory randomization](#memory-randomization) + * [Loops](#loops) + + [Scalar remainder handling](#scalar-remainder-handling) + + [Vectorized remainder handling](#vectorized-remainder-handling) + + [AV testing](#av-testing) + * [Loading and storing vectors](#loading-and-storing-vectors) + + [Loading](#loading) + + [Storing](#storing) + + [Casting](#casting) + * [Mindset](#mindset) + + [Edge cases](#edge-cases) + + [Scalar solution](#scalar-solution) + + [Vectorized solution](#vectorized-solution) + * [Toolchain](#toolchain) + + [Creation](#creation) + + [Bit operations](#bit-operations) + + [Equality](#equality) + + [Comparison](#comparison) + + [Math](#math) + + [Conversion](#conversion) + + [Widening and Narrowing](#widening-and-narrowing) + + [Shuffle](#shuffle) + * [Summary](#summary) + + [Best practices](#best-practices) + +TL;DR: Go to [Summary](#summary) + +# Introduction to vectorization with Vector128 and Vector256 + +Vectorization is an art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity. + +In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code. + +.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` that aim for writing hardware-agnostic vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices. + +## Code structure + +`Vector128` represents a 128-bit vector of type `T`. `T` is constrained to specific primitive types: + +* `byte` and `sbyte` (8 bits). +* `short` and `ushort` (16 bits). +* `int`, `uint` and `float` (32 bits). +* `long`, `ulong` and `double` (64 bits). +* `nint` and `unit` (32 or 64 bits, depending on the architecture) + +Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats and 2 (u)longs/double(s). + +``` +------------------------------128-bits--------------------------- +| 64 | 64 | +----------------------------------------------------------------- +| 32 | 32 | 32 | 32 | +----------------------------------------------------------------| +| 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | +----------------------------------------------------------------- +| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | +----------------------------------------------------------------- +``` + +`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, we should prefer it over a `Vector128`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. + +We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector128.Count` return the size of a vector of given type in bytes. +Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT). + +That is why the code is very often structured like this: + +```cs +void CodeStructure(ReadOnlySpan buffer) +{ + if (Vector256.IsHardwareAccelerated && buffer.Length >= Vector256.Count) + { + // Vector256 code path + } + else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count) + { + // Vector128 code path + } + else + { + // non-vectorized && small inputs code path + } +} +``` + +**Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block: + +```cs +else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count) +{ + // Vector128 code path +} +``` + +### Testing + +Such a code structure requires us to **test all possible code paths**: + +* `Vector256` is accelerated: + * The input is large enough to benefit from vectorization with `Vector256`. + * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128` (when `Vector256` is accelerated then `Vector128` and smaller vectors are also). + * The input is too small to benefit from any kind of vectorization. +* `Vector128` is accelerated + * The input is large enough to benefit from vectorization with `Vector128`. + * The input is too small to benefit from any kind of vectorization. +* Neither `Vector128` or `Vector256` are accelerated. + +It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started: + +* When `COMPlus_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`. +* When `COMPlus_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`. +* When `COMPlus_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`. + +Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with: +* no custom settings +* `COMPlus_EnableAVX2=0` +* `COMPlus_EnableAVX=0` (it can be skipped if `Vector64` and `Vector` are not involved) +* `COMPlus_EnableHWIntrinsic=0` + +### Benchmarking + +All that complexity needs to pay off. We need to **benchmark the code to verify that the investment is beneficial**. We can do that with [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet). + +#### Custom config + +It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios: + +```cs +static void Main(string[] args) +{ + Job enough = Job.Default + .WithWarmupCount(1) + .WithIterationTime(TimeInterval.FromSeconds(0.25)) + .WithMaxIterationCount(20); + + IConfig config = DefaultConfig.Instance + .HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error) + .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig + (exportGithubMarkdown: true, printInstructionAddresses: false))) + .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline()); + + if (Vector256.IsHardwareAccelerated) + { + config = config + .AddJob(enough.WithId("Vector256")) + .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableAVX2", "0").WithId("Vector128")); + + } + else if (Vector128.IsHardwareAccelerated) + { + config = config.AddJob(enough.WithId("Vector128")); + } + + BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly) + .Run(args, config); +} +``` + +**Note:** the config defines a [disassembler](https://adamsitnik.com/Disassembly-Diagnoser/), which exports a disassembly in GitHub markdown format (supported on both x64 and arm64, Windows and Linux). It is very often an invaluable tool when working with high-performance code where inspecting generated assembly code is required. + +#### Memory alignment + +BenchmarkDotNet does a lot of heavy lifting for the end users, but it can not protect us from the random memory alignment which can be different per each benchmark run and affect the stability of the benchmarks. + +We have three possibilities: + +* We can enforce the alignment ourselves and have very stable results. +* We can ask the harness to try to randomize the memory and observe entire possible distribution with each run. +* We can do nothing and wonder why the results vary from time to time. + +##### Enforcing memory alignment + +We can allocate aligned unmanaged memory by using the [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). + +```cs +public unsafe class Benchmarks +{ + private void* _pointer; + + [Params(6, 32, 1024)] // test various sizes + public uint Size; + + [GlobalSetup] + public void Setup() + { + _pointer = NativeMemory.AlignedAlloc(byteCount: Size * sizeof(int), alignment: 32); + new Span(_pointer, (int)Size).Fill(0); // ensure it's all zeros, so 1 is never found + } + + [Benchmark] + public bool Contains() + { + ReadOnlySpan buffer = new (_pointer, (int)Size); + return buffer.Contains(1); + } + + [GlobalCleanup] + public void Cleanup() => NativeMemory.AlignedFree(_pointer); +} +``` + +Sample results (please mind the AVX2, AVX and SSE4.2 information printed in the summary): + +```ini +BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2) +AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores +.NET SDK=8.0.100-alpha.1.22558.1 + [Host] : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2 + Scalar : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT + Vector128 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX + Vector256 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2 +``` + +``` +| Method | Job | Size | Mean | StdDev | Ratio | Code Size | +|--------- |---------- |----- |-----------:|----------:|------:|----------:| +| Contains | Scalar | 1024 | 143.844 ns | 0.6234 ns | 1.00 | 206 B | +| Contains | Vector128 | 1024 | 104.544 ns | 1.2792 ns | 0.73 | 335 B | +| Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B | +``` + +The results should be very stable (flat distributions), but on the other hand we are measuring the performance of best case scenario (the input is large and it's entire content is searched for, as the value is never found). + +Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements. + +##### Memory randomization + +The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory. + +You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject. + +No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, you might not benefit from it, or even regress the performance. + +## Loops + +To work with inputs that are bigger than a single vector, we typically need to loop over the entire input. This should be split into two parts: + +* vectorized loop that operates on multiple values at a time +* handling of the remainder + +Example: our input is a buffer of ten integers, assuming that `Vector128` is accelerated, we handle the first four values in the first loop iteration, the next four in the second iteration and then we stop, as only two are left. Depending on how we can handle the remainder, we distinguish two approaches. + +### Scalar remainder handling + +Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128/32) integers in one iteration. In the second loop, we handle the remaining values. + + +```cs +int Sum(Span buffer) +{ + Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); + + // The initial sum is zero, so we need a vector with all elements initialized to zero. + Vector128 sum = Vector128.Zero; + + // We need to obtain the reference to first value in the buffer, it's used later for loading vectors from memory. + ref int searchSpace = ref MemoryMarshal.GetReference(buffer); + // And an offset, that is going to be used by vectorized and scalar loops. + nuint elementOffset = 0; + // And the last valid offset from which we can load the values + nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128.Count); + for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128.Count) + { + // We load a vector from given offset. + Vector128 loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); + // We add 4 integers at a time: + sum += loaded; + } + + // We sum all 4 integers from the vector to one + int result = Vector128.Sum(sum); + + // And handle the remaining elements, in a non-vectorized way: + while (elementOffset < (nuint)buffer.Length) + { + result += buffer[(int)elementOffset]; + elementOffset++; + } + + return result; +} +``` + +**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios. If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced. + +**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! + +**Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128` and `Vector256` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods. + +### Vectorized remainder handling + +Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. + +Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice. + +```cs +bool Contains(Span buffer, int searched) +{ + Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); + + Vector128 loaded; + // We need a vector for storing the searched value. + Vector128 values = Vector128.Create(searched); + + ref int searchSpace = ref MemoryMarshal.GetReference(buffer); + nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128.Count); + for (nuint elementOffset = 0; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128.Count) + { + loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); + // compare the loaded vector with searched value vector + if (Vector128.Equals(loaded, values) != Vector128.Zero) + { + return true; // return true if a difference was found + } + } + + // If any elements remain, process the last vector in the search space. + if ((uint)buffer.Length % Vector128.Count != 0) + { + loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd); + if (Vector128.Equals(loaded, values) != Vector128.Zero) + { + return true; + } + } + + return false; +} +``` + +`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128.Zero` is an equivalent of `Vector128.Create(0)`. + +`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match. + +### AV testing + +Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues. + +Let's look at the following code: + +```diff +nuint elementOffset = 0; +while (elementOffset < (nuint)buffer.Length) +{ + loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); + + elementOffset += (nuint)Vector128.Count; +} +``` + +How many time the loop is going to execute for a buffer of six integers? Twice! The first time it's going to load the first four elements, the second time it's going to load the two last elements and turn random memory that is following the buffer into next two elements! + +Writing tests that detect such issues is hard, but not impossible. .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`. + +## Loading and storing vectors + +### Loading + +Both `Vector128` and `Vector256` provide at least five ways of loading them from memory: + +```cs +public static class Vector128 +{ + public static Vector128 Load(T* source) where T : unmanaged; + public static Vector128 LoadAligned(T* source) where T : unmanaged; + public static Vector128 LoadAlignedNonTemporal(T* source) where T : unmanaged; + public static Vector128 LoadUnsafe(ref T source) where T : struct; + public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; +} +``` + +The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first (the GC is not tracking unmanaged pointers, we have to ensure that the memory does not get moved by GC in the meantime, as the pointers would silently become invalid). That is simple, the problem is doing the pointer arithmetic right: + +```cs +unsafe int UnmanagedPointersSum(Span buffer) +{ + fixed (int* pBuffer = buffer) + { + int* pEnd = pBuffer + buffer.Length; + int* pOneVectorFromEnd = pEnd - Vector128.Count; + int* pCurrent = pBuffer; + + Vector128 sum = Vector128.Zero; + + while (pCurrent <= pOneVectorFromEnd) + { + sum += Vector128.Load(pCurrent); + + pCurrent += Vector128.Count; + } + + int result = Vector128.Sum(sum); + + while (pCurrent < pEnd) + { + result += *pCurrent; + + pCurrent++; + } + + return result; + } +} +``` + +The `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. + +Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. + +The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware. + +The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic: + +```cs +int ManagedReferencesSum(int[] buffer) +{ + ref int current = ref MemoryMarshal.GetArrayDataReference(buffer); + ref int end = ref Unsafe.Add(ref current, buffer.Length); + ref int oneVectorAwayFromEnd = ref Unsafe.Add(ref end, -Vector128.Count); + + Vector128 sum = Vector128.Zero; + + while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd)) + { + sum += Vector128.LoadUnsafe(ref current); + + current = ref Unsafe.Add(ref current, Vector128.Count); + } + + int result = Vector128.Sum(sum); + + while (Unsafe.IsAddressLessThan(ref current, ref end)) + { + result += current; + + current = ref Unsafe.Add(ref current, 1); + } + + return result; +} +``` + +**Note:** `Unsafe` does not expose a method called "IsGreaterOrEqualThan", so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect. + +**Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple: + +```cs +ref TValue currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128.Count); + +do +{ + equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace)); + if (equals == Vector128.Zero) + { + currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128.Count); + continue; + } + + return ...; +} +while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace)); +``` + +It was part of `LastIndexOf` implementation, where we were iterating from the end to the beginning of the buffer. In the last iteration of the loop, `currentSearchSpace` could become a pointer to unknown memory that lied before the beginning of the buffer: + +```cs +currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128.Count); +``` + +And it was fine until GC kicked right after that, moved objects in memory, updated all valid managed references and resumed the execution, which run following condition: + +```cs +while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace)); +``` + +Which could return true because `currentSearchSpace` was invalid and not updated. If you are interested in more details, you can check the [issue](https://github.com/dotnet/runtime/issues/75792#issuecomment-1249973858) and the [fix](https://github.com/dotnet/runtime/pull/75857). + +That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic!** + +```cs +public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; +``` + +**The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.** + +```cs +Span buffer = new int[2] { 1, 2 }; +nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128.Count); +Console.WriteLine(oneVectorAwayFromEnd); +``` + +Can you guess the result? For a 64 bit process it's `FFFFFFFFFFFFFFFE` (a hex representation of `18446744073709551614`)! That is why the length of the buffer needs to be always checked before doing similar computations! + +### Storing + +Similarly to loading, both `Vector128` and `Vector256` provide at least five ways of storing them in memory: + +```cs +public static class Vector128 +{ + public static void Store(this Vector128 source, T* destination) where T : unmanaged; + public static void StoreAligned(this Vector128 source, T* destination) where T : unmanaged; + public static void StoreAlignedNonTemporal(this Vector128 source, T* destination) where T : unmanaged; + public static void StoreUnsafe(this Vector128 source, ref T destination) where T : struct; + public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct; +} +``` + +For the reasons described for loading, we recommend using the overload that takes managed reference and element offset: + +```cs +public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct; +``` + +**Note**: when loading values from one buffer and storing them into another, we need to consider whether they overlap or not. [MemoryExtensions.Overlap](https://learn.microsoft.com/dotnet/api/system.memoryextensions.overlaps#system-memoryextensions-overlaps-1(system-readonlyspan((-0))-system-readonlyspan((-0)))) is an API for doing that. + +### Casting + +As mentioned before, `Vector128` and `Vector256` are constrained to a specific set of primitive types. `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution. + +[Unsafe.As](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type: + +```cs +void CastingReferences(Span buffer) +{ + ref char charSearchSpace = ref MemoryMarshal.GetReference(buffer); + ref short searchSpace = ref Unsafe.As(ref charSearchSpace); + // from now on we can use Vector128 or Vector256 +} +``` + +Or [MemoryMarshal.Cast](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.memorymarshal.cast#system-runtime-interopservices-memorymarshal-cast-2(system-readonlyspan((-0)))), which casts a span of one primitive type to a span of another primitive type: + +```cs +void CastingSpans(Span chars) +{ + Span shorts = MemoryMarshal.Cast(chars); +} +``` + +It's also possible to get managed references from unmanaged pointers: + +```cs +void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer) +{ + // of the same type: + ref byte asciiBuffer = ref *pAsciiBuffer; + // of different types: + ref ushort utf16Buffer = ref *(ushort*)pUtf16Buffer; +} +``` + +We should avoid doing this in the opposite direction, as most engineers will assume that unmanaged pointers are already pinned. + +## Mindset + +Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly. + +Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved. + +### Edge cases + +Before we start working on the implementation, let's list all edge cases for our `IsAcii(ReadOnlySpan buffer)` method (and ideally write tests): + +* It does not need to throw any argument exceptions, as `ReadOnlySpan` is `struct` and it can never be `null` or invalid. +* It should return `true` for an empty buffer. +* It should detect invalid characters in the entire buffer, including the remainder. +* It should not read any bytes that don't belong to the provided buffer. + +### Scalar solution + +Once we know all edge cases, we need to understand our problem and find a scalar solution. + +ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned) as `sbyte` (signed), it's a matter of performing "is less than zero" check. + +The binary representation of 0-127 range is following: + +```log +00000000 +01111111 +^ +most significant bit +``` + +When we look at it, we can realize that another way is checking whether the most significant bit is equal `1`. For the scalar version, we could perform a logical AND: + +```cs +bool IsValidAscii(byte c) => (c & 0b1000_0000) == 0; +``` + +### Vectorized solution + +Another step is vectorizing our scalar solution and choosing the best way of doing that based on data. + +If we reuse one of the loops presented in the previous sections, all we need to implement is a method that accepts `Vector128` and returns `bool` and does exactly the same thing that our scalar method did, but for a vector rather than single value: + +```cs +[MethodImpl(MethodImplOptions.AggressiveInlining)] +bool VectorContainsNonAsciiChar(Vector128 asciiVector) +{ + // to perform "> 127" check we can use GreaterThanAny method: + return Vector128.GreaterThanAny(asciiVector, Vector128.Create((byte)127)) + // to perform "< 0" check, we need to use AsSByte and LessThanAny methods: + return Vector128.LessThanAny(asciiVector.AsSByte(), Vector128.Zero) + // to perform an AND operation, we need to use & operator + return (asciiVector & Vector128.Create((byte)0b_1000_0000)) != Vector128.Zero; + // we can also just use ExtractMostSignificantBits method: + return asciiVector.ExtractMostSignificantBits() != 0; +} +``` + +We can also use the hardware-specific instructions if they are available: + +```cs +if (Sse41.IsSupported) +{ + return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000)); +} +else if (AdvSimd.Arm64.IsSupported) +{ + Vector128 maxBytes = AdvSimd.Arm64.MaxPairwise(asciiVector, asciiVector); + return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) != 0; +} +``` + +Benchmark all available solutions, and choose the one that is the best for us. + +```ini +BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2) +AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores +.NET SDK=8.0.100-alpha.1.22558.1 + [Host] : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2 +``` + +``` +| Method | Size | Mean | Ratio | Code Size | +|--------------------------- |----- |----------:|------:|----------:| +| Scalar | 1024 | 252.13 ns | 1.00 | 69 B | +| GreaterThanAny | 1024 | 32.49 ns | 0.13 | 178 B | +| LessThanAny | 1024 | 29.33 ns | 0.12 | 146 B | +| And | 1024 | 26.13 ns | 0.10 | 138 B | +| TestZ | 1024 | 27.26 ns | 0.11 | 129 B | +| ExtractMostSignificantBits | 1024 | 27.33 ns | 0.11 | 141 B | +``` + +Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**. + +## Toolchain + +`Vector128`, `Vector128`, `Vector256` and `Vector256` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, it's important to remember what kind of operations they allow for and check the details when needed. + +### Creation + +Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value. + +```cs +public static Vector128 Create(T value) where T : struct; +``` + +`CreateScalar` initializes first element to the specified value, and the remaining elements to zero. + +```cs +public static Vector128 CreateScalar(int value); +``` + +`CreateScalarUnsafe` is similar, but the remaining elements are left uninitialized. It's dangerous! + + +We also have an overload that allows for specifying every value in given vector: + +```cs +public static Vector128 Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7) +``` + +And last, but not least a `Create` overload that accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ.Count`-many elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (performance). + +```cs +public static Vector128 Create(ReadOnlySpan values) where T : struct +``` + +to perform a copy in the other direction, we can use one of the `CopyTo` extension methods: + +```cs +public static void CopyTo(this Vector128 vector, Span destination) where T : struct +``` + +### Bit operations + +All size-specific vector types provide a set of APIs for common bit operations. + +`BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`). + +```cs +public static Vector128 BitwiseAnd(Vector128 left, Vector128 right) where T : struct => left & right; +public static Vector128 BitwiseOr(Vector128 left, Vector128 right) where T : struct => left | right; +public static Vector128 Xor(Vector128 left, Vector128 right) => left ^ right; +public static Vector128 Negate(Vector128 vector) => ~vector; +``` + +`AndNot` computes the bitwise-and of a given vector and the ones complement of another vector. + +```cs +public static Vector128 AndNot(Vector128 left, Vector128 right) => left & ~right; +``` + +`ShiftLeft` shifts each element of a vector left by the specified number of bits. +`ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift: + +```cs +public static Vector128 ShiftLeft(Vector128 vector, int shiftCount); +public static Vector128 ShiftRightArithmetic(Vector128 vector, int shiftCount); +public static Vector128 ShiftRightLogical(Vector128 vector, int shiftCount); +``` + +### Equality + +`EqualsAll` compares two vectors to determine if all elements are equal. `EqualsAny` compares two vectors to determine if any elements are equal. + +```cs +public static bool EqualsAll(Vector128 left, Vector128 right) where T : struct => left == right; +public static bool EqualsAny(Vector128 left, Vector128 right) where T : struct +``` + +`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` arguments were equal. + +```cs +public static Vector128 Equals(Vector128 left, Vector128 right) where T : struct +``` + +How to calculate the index of first match? Let's take a closer look at the result of following equality check: + +```cs +Vector128 left = Vector128.Create(1, 2, 3, 4); +Vector128 right = Vector128.Create(0, 0, 3, 0); +Vector128 equals = Vector128.Equals(left, right); +Console.WriteLine(equals); +``` + +```log +<0, 0, -1, 0> +``` + +`-1` is just `FFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element. + +```cs +public static T GetElement(this Vector128 vector, int index) where T : struct +``` + +But it would not be an optimal solution. We should rather extract the most significant bits: + +```cs +uint mostSignificantBits = equals.ExtractMostSignificantBits(); +Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0')); +``` + +```log +00000000000000000000000000000100 +``` + +and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get trailing zero count. + +To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, and indexed from 0). + +If we were working with a buffer loaded from memory (example: searching for the last index of given character in a buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer. + +```cs +int ComputeLastIndex(nint elementOffset, Vector128 equals) where T : struct +{ + uint mostSignificantBits = equals.ExtractMostSignificantBits(); + + int index = 31 - BitOperations.LeadingZeroCount(mostSignificantBits); // 31 = 32 (bits in UInt32) - 1 (indexing from zero) + + return (int)elementOffset + index; +} +``` + +If we were using the `Load` overload that takes only the managed reference, we could use [Unsafe.ByteOffset(ref T, ref T)](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.byteoffset) to calculate the element offset. + +```cs +unsafe int ComputeFirstIndex(ref T searchSpace, ref T current, Vector128 equals) where T : struct +{ + int elementOffset = (int)Unsafe.ByteOffset(ref searchSpace, ref current) / sizeof(T); + + uint mostSignificantBits = equals.ExtractMostSignificantBits(); + int index = BitOperations.TrailingZeroCount(mostSignificantBits); + + return elementOffset + index; +} +``` + +### Comparison + +Beside equality checks, vector APIs allow for comparison. The `bool` returning overload return `true` when given condition is true: + +```cs +public static bool GreaterThanAll(Vector128 left, Vector128 right) where T : struct +public static bool GreaterThanAny(Vector128 left, Vector128 right) where T : struct +public static bool GreaterThanOrEqualAll(Vector128 left, Vector128 right) where T : struct +public static bool GreaterThanOrEqualAny(Vector128 left, Vector128 right) where T : struct +public static bool LessThanAll(Vector128 left, Vector128 right) where T : struct +public static bool LessThanAny(Vector128 left, Vector128 right) where T : struct +public static bool LessThanOrEqualAll(Vector128 left, Vector128 right) where T : struct +public static bool LessThanOrEqualAny(Vector128 left, Vector128 right) where T : struct +``` + +Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` meet given condition. + +```cs +public static Vector128 GreaterThan(Vector128 left, Vector128 right) where T : struct +public static Vector128 GreaterThanOrEqual(Vector128 left, Vector128 right) where T : struct +public static Vector128 LessThan(Vector128 left, Vector128 right) where T : struct +public static Vector128 LessThanOrEqual(Vector128 left, Vector128 right) where T : struct +``` + +`ConditionalSelect` Conditionally selects a value from two vectors on a bitwise basis. + +```cs +public static Vector128 ConditionalSelect(Vector128 condition, Vector128 left, Vector128 right) +``` + +This method deserves a self-describing example: + +```cs +Vector128 left = Vector128.Create(1.0f, 2, 3, 4); +Vector128 right = Vector128.Create(4.0f, 3, 2, 1); + +Vector128 result = Vector128.ConditionalSelect(Vector128.GreaterThan(left, right), left, right); + +Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result); +``` + +### Math + +Very simple math operations can be also expressed by using the operators: + +```cs +public static Vector128 Add(Vector128 left, Vector128 right) where T : struct => left + right; +public static Vector128 Divide(Vector128 left, Vector128 right) => left / right; +public static Vector128 Divide(Vector128 left, T right) => left / right; +public static Vector128 Multiply(Vector128 left, Vector128 right) => left * right; +public static Vector128 Multiply(Vector128 left, T right) => left * right; +public static Vector128 Subtract(Vector128 left, Vector128 right) => left - right; +``` + +**Note:** Some of the methods accept a single value as the second argument. + +`Abs`, `Ceiling`, `Floor`, `Max`, `Min`, `Sqrt` and `Sum` are also provided: + +```cs +public static Vector128 Abs(Vector128 vector) where T : struct +public static Vector128 Ceiling(Vector128 vector) +public static Vector128 Floor(Vector128 vector) +public static Vector128 Max(Vector128 left, Vector128 right) +public static Vector128 Min(Vector128 left, Vector128 right) +public static Vector128 Sqrt(Vector128 vector); +public static T Sum(Vector128 vector) where T : struct +``` + +### Conversion + +Vector types provide a set of methods dedicated to numbers conversion: + +```cs +public static unsafe Vector128 ConvertToDouble(Vector128 vector) +public static unsafe Vector128 ConvertToDouble(Vector128 vector) +public static unsafe Vector128 ConvertToInt32(Vector128 vector) +public static unsafe Vector128 ConvertToInt64(Vector128 vector) +public static unsafe Vector128 ConvertToSingle(Vector128 vector) +public static unsafe Vector128 ConvertToSingle(Vector128 vector) +public static unsafe Vector128 ConvertToUInt32(Vector128 vector) +public static unsafe Vector128 ConvertToUInt64(Vector128 vector) +``` + +And for reinterpretation (no values are being changed, they can be just used as if they were of a different type): + +```cs +public static Vector128 As(this Vector128 vector) +public static Vector128 AsByte(this Vector128 vector) +public static Vector128 AsDouble(this Vector128 vector) +public static Vector128 AsInt16(this Vector128 vector) +public static Vector128 AsInt32(this Vector128 vector) +public static Vector128 AsInt64(this Vector128 vector) +public static Vector128 AsNInt(this Vector128 vector) +public static Vector128 AsNUInt(this Vector128 vector) +public static Vector128 AsSByte(this Vector128 vector) +public static Vector128 AsSingle(this Vector128 vector) +public static Vector128 AsUInt16(this Vector128 vector) +public static Vector128 AsUInt32(this Vector128 vector) +public static Vector128 AsUInt64(this Vector128 vector) +``` + +### Widening and Narrowing + +The first half of every vector is called "lower", the second is "upper". + +``` +------------------------------128-bits--------------------------- +| LOWER | UPPER | +----------------------------------------------------------------- +| 32 | 32 | 32 | 32 | +----------------------------------------------------------------| +| 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | +----------------------------------------------------------------- +| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | +----------------------------------------------------------------- +``` + +In case of `Vector128`, `GetLower` gets the value of the lower 64-bits as a new `Vector64` and `GetUpper` gets the upper 64-bits. + +```cs +public static Vector64 GetLower(this Vector128 vector) +public static Vector64 GetUpper(this Vector128 vector) +``` + +Each vector type provides a `Create` method that allows for the creation from lower and upper: + +```cs +public static unsafe Vector128 Create(Vector64 lower, Vector64 upper) +public static Vector256 Create(Vector128 lower, Vector128 upper) +``` + +`Lower` and `Upper` are also used by `Widen`. This method widens a `Vector128` into two `Vector128` where `sizeof(T2) == 2 * sizeof(T1)`. + +```cs +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) +``` + +It's also possible to widen only the lower or upper part: + +```cs +public static Vector128 WidenLower(Vector128 source) +public static Vector128 WidenUpper(Vector128 source) +``` + +An example of widening is converting a buffer of ASCII bytes into characters: + +```cs +byte[] byteBuffer = Enumerable.Range('A', 128 / 8).Select(i => (byte)i).ToArray(); +Vector128 byteVector = Vector128.Create(byteBuffer); +Console.WriteLine(byteVector); +(Vector128 Lower, Vector128 Upper) = Vector128.Widen(byteVector); +Console.Write(Lower.AsByte()); +Console.WriteLine(Upper.AsByte()); + +Vector256 ushortVector = Vector256.Create(Lower, Upper); +Span ushortBuffer = stackalloc ushort[256 / 16]; +ushortVector.CopyTo(ushortBuffer); +Span charBuffer = MemoryMarshal.Cast(ushortBuffer); +Console.WriteLine(new string(charBuffer)); +``` + +```log +<65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80> +<65, 0, 66, 0, 67, 0, 68, 0, 69, 0, 70, 0, 71, 0, 72, 0><73, 0, 74, 0, 75, 0, 76, 0, 77, 0, 78, 0, 79, 0, 80, 0> +ABCDEFGHIJKLMNOP +``` + +`Narrow` is the opposite of `Widen`. + +```cs +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) +``` + +In contrary to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector: + + +```cs +Vector256 ushortVector = Vector256.Create((ushort)300); +Console.WriteLine(ushortVector); +unchecked { Console.WriteLine((byte)300); } +Console.WriteLine(300 & byte.MaxValue); +Console.WriteLine(Vector128.Narrow(ushortVector.GetLower(), ushortVector.GetUpper())); + +if (Sse2.IsSupported) +{ + Console.WriteLine(Sse2.PackUnsignedSaturate(ushortVector.GetLower().AsInt16(), ushortVector.GetUpper().AsInt16())); +} +``` + +```log +<300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300> +44 +44 +<44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44> +<255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255> +``` + +### Shuffle + +`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes if the input vector). + +```cs +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +public static Vector128 Shuffle(Vector128 vector, Vector128 indices) +``` + +It can be used for many things, including reversing the input: + +```cs +Vector128 intVector = Vector128.Create(100, 200, 300, 400); +Console.WriteLine(intVector); +Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0))); +``` + +```log +<100, 200, 300, 400> +<400, 300, 200, 100> +``` + +#### Vector256.Shuffle vs Avx2.Shuffle + +`Vector256.Shuffle` and `Avx2.Shuffle` are not identical. + +`Avx2.Shuffle` is effectively `2x128-bit ops` and so if we do `Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L))` it is going to think we want `value[0], value[1], value[0], value[1]`. Where-as `Avx2.Shuffle` treats this as `value[0], value[1], value[2], value[3]`. + +While `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width". + +## Summary + +The main goal of the new `Vector128` and `Vector256` APIs is to make writing fast, vectorized code possible without becoming familiar with hardware-specific instructions and becoming an assembly language expert. Our recommendations depend on your current expertise level, software you maintain and the one you need to create: + +- If you are already an expert and you have vectorized your code for both `x64/x86` and `arm64/arm` code you can use the new APIs to simplify your code, but you most likely won't observe any performance gains. [#64451](https://github.com/dotnet/runtime/issues/64451) lists the places where it was/can be done in dotnet/runtime. You can use links to the merged PRs to see real-life examples. +- If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the new APIs to have a single, cross-platform implementation. +- If you have already vectorized your code with `Vector` you can use the new APIs to check if they can produce better codegen. +- If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths. + +### Best practices + +1. Implement tests that cover all code paths, including Acces Violation. +2. Run tests for all hardware acceleration scenarios, use the existing env vars to do that. +3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users. +4. Prefer managed references over unsafe pointers to avoid pinning and safety issues. +5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly. +6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow! +7. Always handle the vectorized loop remainder. +8. When storing values in memory, be aware of a potential buffer overlap. +9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. Over time, you may gain enough experience to skip the scalar step. +10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed. From 5e89ed811d8710ac40701385982df23ac253fdf3 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Thu, 30 Mar 2023 14:21:09 +0200 Subject: [PATCH 02/14] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Günther Foidl --- .../vectorization-guidelines.md | 42 +++++++++---------- 1 file changed, 20 insertions(+), 22 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 394b91af3ba0c5..cc7d5b93eba8ab 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -38,7 +38,7 @@ Vectorization is an art of converting an algorithm from operating on a single va In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code. -.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` that aim for writing hardware-agnostic vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices. +.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` that aim for writing hardware-agnostic, and cross platform vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices. ## Code structure @@ -64,9 +64,9 @@ Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u) ----------------------------------------------------------------- ``` -`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, we should prefer it over a `Vector128`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. +`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough we should prefer it over a `Vector128`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. -We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector128.Count` return the size of a vector of given type in bytes. +We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the size of a vector of given type in bytes. Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT). That is why the code is very often structured like this: @@ -113,15 +113,15 @@ Such a code structure requires us to **test all possible code paths**: It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started: -* When `COMPlus_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`. -* When `COMPlus_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`. -* When `COMPlus_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`. +* When `DOTNET_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`. +* When `DOTNET_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`. +* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`. Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with: * no custom settings -* `COMPlus_EnableAVX2=0` -* `COMPlus_EnableAVX=0` (it can be skipped if `Vector64` and `Vector` are not involved) -* `COMPlus_EnableHWIntrinsic=0` +* `DOTNET_EnableAVX2=0` +* `DOTNET_EnableAVX=0` (it can be skipped if `Vector64` and `Vector` are not involved) +* `DOTNET_EnableHWIntrinsic=0` ### Benchmarking @@ -143,13 +143,13 @@ static void Main(string[] args) .HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error) .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig (exportGithubMarkdown: true, printInstructionAddresses: false))) - .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline()); + .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline()); if (Vector256.IsHardwareAccelerated) { config = config .AddJob(enough.WithId("Vector256")) - .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableAVX2", "0").WithId("Vector128")); + .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableAVX2", "0").WithId("Vector128")); } else if (Vector128.IsHardwareAccelerated) @@ -287,7 +287,7 @@ int Sum(Span buffer) } ``` -**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios. If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced. +**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced. **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! @@ -321,7 +321,7 @@ bool Contains(Span buffer, int searched) } // If any elements remain, process the last vector in the search space. - if ((uint)buffer.Length % Vector128.Count != 0) + if (buffer.Length % Vector128.Count != 0) { loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd); if (Vector128.Equals(loaded, values) != Vector128.Zero) @@ -338,7 +338,7 @@ bool Contains(Span buffer, int searched) `Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match. -### AV testing +### Access violation (AV) testing Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues. @@ -422,7 +422,7 @@ int ManagedReferencesSum(int[] buffer) { ref int current = ref MemoryMarshal.GetArrayDataReference(buffer); ref int end = ref Unsafe.Add(ref current, buffer.Length); - ref int oneVectorAwayFromEnd = ref Unsafe.Add(ref end, -Vector128.Count); + ref int oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128.Count); Vector128 sum = Vector128.Zero; @@ -577,7 +577,7 @@ Before we start working on the implementation, let's list all edge cases for our Once we know all edge cases, we need to understand our problem and find a scalar solution. -ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned) as `sbyte` (signed), it's a matter of performing "is less than zero" check. +ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned, range from 0 to 255) as `sbyte` (signed, range from -128 to 127), it's a matter of performing "is less than zero" check. The binary representation of 0-127 range is following: @@ -746,7 +746,7 @@ Console.WriteLine(equals); <0, 0, -1, 0> ``` -`-1` is just `FFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element. +`-1` is just `0xFFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element. ```cs public static T GetElement(this Vector128 vector, int index) where T : struct @@ -1033,9 +1033,7 @@ Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0))); `Vector256.Shuffle` and `Avx2.Shuffle` are not identical. -`Avx2.Shuffle` is effectively `2x128-bit ops` and so if we do `Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L))` it is going to think we want `value[0], value[1], value[0], value[1]`. Where-as `Avx2.Shuffle` treats this as `value[0], value[1], value[2], value[3]`. - -While `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width". +`Avx2.Shuffle` is effectively `2x128-bit ops` while `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width". ## Summary @@ -1048,8 +1046,8 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas ### Best practices -1. Implement tests that cover all code paths, including Acces Violation. -2. Run tests for all hardware acceleration scenarios, use the existing env vars to do that. +1. Implement tests that cover all code paths, including Acces Violations. +2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that. 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users. 4. Prefer managed references over unsafe pointers to avoid pinning and safety issues. 5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly. From fe4aacac0195436a44c5ee8b02e583c142bc631b Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Thu, 30 Mar 2023 22:17:47 +0200 Subject: [PATCH 03/14] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Dan Moseley Co-authored-by: Günther Foidl --- .../vectorization-guidelines.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index cc7d5b93eba8ab..32e6dc9ced2415 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -9,7 +9,7 @@ * [Loops](#loops) + [Scalar remainder handling](#scalar-remainder-handling) + [Vectorized remainder handling](#vectorized-remainder-handling) - + [AV testing](#av-testing) + + [Access violation testing](#access-violation-av-testing) * [Loading and storing vectors](#loading-and-storing-vectors) + [Loading](#loading) + [Storing](#storing) @@ -34,11 +34,11 @@ TL;DR: Go to [Summary](#summary) # Introduction to vectorization with Vector128 and Vector256 -Vectorization is an art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity. +Vectorization is the art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity. -In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code. +In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (x64 and/or arm64), with the option of using the most optimal instructions for hardware that is executing the code. -.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` that aim for writing hardware-agnostic, and cross platform vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices. +.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices. ## Code structure @@ -50,7 +50,7 @@ In the recent releases, .NET has introduced plenty of APIs for vectorization. Va * `long`, `ulong` and `double` (64 bits). * `nint` and `unit` (32 or 64 bits, depending on the architecture) -Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats and 2 (u)longs/double(s). +A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s). ``` ------------------------------128-bits--------------------------- @@ -64,10 +64,10 @@ Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u) ----------------------------------------------------------------- ``` -`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough we should prefer it over a `Vector128`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. +`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. -We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the size of a vector of given type in bytes. -Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT). +The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the size of a vector of given type in bytes. +Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. In case of pre-compiled code (NativeAOT) it's not true for `IsHardwareAccelerated` property, as the required information is not available at compile time. That is why the code is very often structured like this: @@ -190,7 +190,7 @@ public unsafe class Benchmarks public void Setup() { _pointer = NativeMemory.AlignedAlloc(byteCount: Size * sizeof(int), alignment: 32); - new Span(_pointer, (int)Size).Fill(0); // ensure it's all zeros, so 1 is never found + NativeMemory.Clear(_pointer, byteCount: Size * sizeof(int)); // ensure it's all zeros, so 1 is never found } [Benchmark] @@ -235,11 +235,11 @@ The alternative is to enable memory randomization. Before every iteration, the h You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject. -No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, you might not benefit from it, or even regress the performance. +No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse. ## Loops -To work with inputs that are bigger than a single vector, we typically need to loop over the entire input. This should be split into two parts: +To work with inputs that are bigger than a single vector, you typically need to loop over the entire input. This should be split into two parts: * vectorized loop that operates on multiple values at a time * handling of the remainder @@ -248,7 +248,7 @@ Example: our input is a buffer of ten integers, assuming that `Vector128` is acc ### Scalar remainder handling -Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128/32) integers in one iteration. In the second loop, we handle the remaining values. +Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128 bits / 32 bits) integers in one iteration. In the second loop, we handle the remaining values. ```cs @@ -287,7 +287,7 @@ int Sum(Span buffer) } ``` -**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced. +**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it. **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! @@ -340,7 +340,7 @@ bool Contains(Span buffer, int searched) ### Access violation (AV) testing -Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues. +Handling the remainder in an invalid way may lead to non-deterministic and hard to diagnose issues. Let's look at the following code: @@ -354,9 +354,9 @@ while (elementOffset < (nuint)buffer.Length) } ``` -How many time the loop is going to execute for a buffer of six integers? Twice! The first time it's going to load the first four elements, the second time it's going to load the two last elements and turn random memory that is following the buffer into next two elements! +How many times will the loop execute for a buffer of six integers? Twice! The first time it will load the first four elements, but the second time it will load the random content of the memory following the buffer! -Writing tests that detect such issues is hard, but not impossible. .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`. +Writing tests that detect that issue is hard, but not impossible. The .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates a memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`. ## Loading and storing vectors @@ -375,7 +375,7 @@ public static class Vector128 } ``` -The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first (the GC is not tracking unmanaged pointers, we have to ensure that the memory does not get moved by GC in the meantime, as the pointers would silently become invalid). That is simple, the problem is doing the pointer arithmetic right: +The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right: ```cs unsafe int UnmanagedPointersSum(Span buffer) @@ -409,7 +409,7 @@ unsafe int UnmanagedPointersSum(Span buffer) } ``` -The `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. +`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. From 1fa325fe565f15dc50c7d95cb2505f31b8f26603 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Fri, 31 Mar 2023 15:01:43 +0200 Subject: [PATCH 04/14] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Rob Hague Co-authored-by: Günther Foidl --- .../vectorization-guidelines.md | 20 +++++++++---------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 32e6dc9ced2415..32839aacd65614 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -66,8 +66,8 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short `Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. -The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the size of a vector of given type in bytes. -Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. In case of pre-compiled code (NativeAOT) it's not true for `IsHardwareAccelerated` property, as the required information is not available at compile time. +The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the number of elements of the given type T in a single vector. +Both APIs are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time. That is why the code is very often structured like this: @@ -111,16 +111,14 @@ Such a code structure requires us to **test all possible code paths**: * The input is too small to benefit from any kind of vectorization. * Neither `Vector128` or `Vector256` are accelerated. -It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started: +It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration at the unit test level. It can be controlled with environment variables before .NET process is started: * When `DOTNET_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`. -* When `DOTNET_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`. -* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`. +* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only do both mentioned APIs return `false`, but so also do `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`. -Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with: +Assuming that we run the tests on an `x64` machine that supports `Vector256`, we need to write tests that cover all size scenarios and run them with: * no custom settings * `DOTNET_EnableAVX2=0` -* `DOTNET_EnableAVX=0` (it can be skipped if `Vector64` and `Vector` are not involved) * `DOTNET_EnableHWIntrinsic=0` ### Benchmarking @@ -166,12 +164,12 @@ static void Main(string[] args) #### Memory alignment -BenchmarkDotNet does a lot of heavy lifting for the end users, but it can not protect us from the random memory alignment which can be different per each benchmark run and affect the stability of the benchmarks. +BenchmarkDotNet does a lot of heavy lifting for the end users, but it cannot protect us from the random memory alignment which can be different per each benchmark run and can affect the stability of the benchmarks. We have three possibilities: * We can enforce the alignment ourselves and have very stable results. -* We can ask the harness to try to randomize the memory and observe entire possible distribution with each run. +* We can ask the harness to try to randomize the memory and observe the entire possible distribution with each run. * We can do nothing and wonder why the results vary from time to time. ##### Enforcing memory alignment @@ -233,7 +231,7 @@ Explaining benchmark design guidelines is outside of the scope of this document, The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory. -You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject. +You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) has two chapters dedicated to statistics and can help you get a very good understanding of the subject. No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse. @@ -295,7 +293,7 @@ int Sum(Span buffer) ### Vectorized remainder handling -Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. +Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice. From e7ab25095c1af29f3624496f21aaf7c2c4f07ba4 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Fri, 31 Mar 2023 15:03:43 +0200 Subject: [PATCH 05/14] Apply suggestions from code review Co-authored-by: Rob Hague --- .../vectorization-guidelines.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 32839aacd65614..afcd860d7d9edb 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -223,7 +223,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co | Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B | ``` -The results should be very stable (flat distributions), but on the other hand we are measuring the performance of best case scenario (the input is large and it's entire content is searched for, as the value is never found). +The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large and its entire contents are searched through, as the value is never found). Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements. @@ -295,7 +295,7 @@ int Sum(Span buffer) Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. -Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice. +Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements. ```cs bool Contains(Span buffer, int searched) @@ -332,7 +332,7 @@ bool Contains(Span buffer, int searched) } ``` -`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128.Zero` is an equivalent of `Vector128.Create(0)`. +`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128.Zero` is equivalent to `Vector128.Create(0)`. `Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match. @@ -651,7 +651,7 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop ## Toolchain -`Vector128`, `Vector128`, `Vector256` and `Vector256` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, it's important to remember what kind of operations they allow for and check the details when needed. +`Vector128`, `Vector128`, `Vector256` and `Vector256` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed. ### Creation @@ -676,7 +676,7 @@ We also have an overload that allows for specifying every value in given vector: public static Vector128 Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7) ``` -And last, but not least a `Create` overload that accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ.Count`-many elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (performance). +And last but not least we have a `Create` overload which accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ.Count` elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (for performance). ```cs public static Vector128 Create(ReadOnlySpan values) where T : struct @@ -725,13 +725,13 @@ public static bool EqualsAll(Vector128 left, Vector128 right) where T : public static bool EqualsAny(Vector128 left, Vector128 right) where T : struct ``` -`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` arguments were equal. +`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in the `left` and `right` arguments were equal. ```cs public static Vector128 Equals(Vector128 left, Vector128 right) where T : struct ``` -How to calculate the index of first match? Let's take a closer look at the result of following equality check: +How do we calculate the index of the first match? Let's take a closer look at the result of following equality check: ```cs Vector128 left = Vector128.Create(1, 2, 3, 4); @@ -750,7 +750,7 @@ Console.WriteLine(equals); public static T GetElement(this Vector128 vector, int index) where T : struct ``` -But it would not be an optimal solution. We should rather extract the most significant bits: +But it would not be an optimal solution. We should instead extract the most significant bits: ```cs uint mostSignificantBits = equals.ExtractMostSignificantBits(); @@ -761,11 +761,11 @@ Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0')); 00000000000000000000000000000100 ``` -and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get trailing zero count. +and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get the trailing zero count. -To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, and indexed from 0). +To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0). -If we were working with a buffer loaded from memory (example: searching for the last index of given character in a buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer. +If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer. ```cs int ComputeLastIndex(nint elementOffset, Vector128 equals) where T : struct @@ -794,7 +794,7 @@ unsafe int ComputeFirstIndex(ref T searchSpace, ref T current, Vector128 e ### Comparison -Beside equality checks, vector APIs allow for comparison. The `bool` returning overload return `true` when given condition is true: +Beside equality checks, vector APIs allow for comparison. The `bool`-returning overloads return `true` when the given condition is true: ```cs public static bool GreaterThanAll(Vector128 left, Vector128 right) where T : struct @@ -807,7 +807,7 @@ public static bool LessThanOrEqualAll(Vector128 left, Vector128 right) public static bool LessThanOrEqualAny(Vector128 left, Vector128 right) where T : struct ``` -Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` meet given condition. +Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in `left` and `right` meet the given condition. ```cs public static Vector128 GreaterThan(Vector128 left, Vector128 right) where T : struct @@ -862,7 +862,7 @@ public static T Sum(Vector128 vector) where T : struct ### Conversion -Vector types provide a set of methods dedicated to numbers conversion: +Vector types provide a set of methods dedicated to number conversions: ```cs public static unsafe Vector128 ConvertToDouble(Vector128 vector) @@ -1003,7 +1003,7 @@ if (Sse2.IsSupported) ### Shuffle -`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes if the input vector). +`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes of the input vector). ```cs public static Vector128 Shuffle(Vector128 vector, Vector128 indices) @@ -1044,7 +1044,7 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas ### Best practices -1. Implement tests that cover all code paths, including Acces Violations. +1. Implement tests that cover all code paths, including Access Violations. 2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that. 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users. 4. Prefer managed references over unsafe pointers to avoid pinning and safety issues. From 16c1819fc9be0b61ff7fd33f28f16b8dd9738dc3 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Mon, 3 Apr 2023 10:05:32 +0200 Subject: [PATCH 06/14] Apply suggestions from code review Co-authored-by: Rob Hague --- docs/coding-guidelines/vectorization-guidelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index afcd860d7d9edb..520b465fdc071c 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -701,7 +701,7 @@ public static Vector128 Xor(Vector128 left, Vector128 right) => left public static Vector128 Negate(Vector128 vector) => ~vector; ``` -`AndNot` computes the bitwise-and of a given vector and the ones complement of another vector. +`AndNot` computes the bitwise-and of a given vector and the ones' complement of another vector. ```cs public static Vector128 AndNot(Vector128 left, Vector128 right) => left & ~right; From ebc7da6e9ce6e7965dcc9d56cd29c81caf7ba296 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Wed, 5 Apr 2023 11:09:10 +0200 Subject: [PATCH 07/14] Apply suggestions from code review Co-authored-by: Tanner Gooding Co-authored-by: Stephen Toub --- docs/coding-guidelines/vectorization-guidelines.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 520b465fdc071c..42623ebe5c0556 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -34,9 +34,9 @@ TL;DR: Go to [Summary](#summary) # Introduction to vectorization with Vector128 and Vector256 -Vectorization is the art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity. +Vectorization is the art of converting an algorithm from operating on a single value per iteration to operating on a set of values (vector) per iteration. It can greatly improve performance at a cost of increased code complexity. -In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (x64 and/or arm64), with the option of using the most optimal instructions for hardware that is executing the code. +In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code. .NET 7 introduced a set of new APIs for `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices. @@ -67,7 +67,7 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short `Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the number of elements of the given type T in a single vector. -Both APIs are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time. +Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time. That is why the code is very often structured like this: @@ -346,7 +346,7 @@ Let's look at the following code: nuint elementOffset = 0; while (elementOffset < (nuint)buffer.Length) { - loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); + loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); // BUG! elementOffset += (nuint)Vector128.Count; } From f679f763fd2a42dac323869f17bafe80ff7abff5 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Wed, 5 Apr 2023 11:28:03 +0200 Subject: [PATCH 08/14] address code review feedback from @tannergooding and @stephentoub --- .../vectorization-guidelines.md | 65 ++++++++++++++----- 1 file changed, 50 insertions(+), 15 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 42623ebe5c0556..72efccf6ca0b2a 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -38,17 +38,21 @@ Vectorization is the art of converting an algorithm from operating on a single v In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code. -.NET 7 introduced a set of new APIs for `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices. +.NET 7 introduced a set of new APIs for `Vector64`, `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code (`Vector512` is being introduced in .NET 8). The purpose of this document is to introduce you to the new APIs and provide a set of best practices. ## Code structure -`Vector128` represents a 128-bit vector of type `T`. `T` is constrained to specific primitive types: +`Vector128` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector of type `T`. + +`T` is constrained to specific primitive types: * `byte` and `sbyte` (8 bits). * `short` and `ushort` (16 bits). * `int`, `uint` and `float` (32 bits). * `long`, `ulong` and `double` (64 bits). -* `nint` and `unit` (32 or 64 bits, depending on the architecture) +* `nint` and `unit` (32 or 64 bits, depending on the architecture, available in .NET 7+) + +.NET 8 is introducing a `Vector128.IsSupported` that helps identify whether a given `T` will throw or not to help identify what works per runtime, including from generic contexts. A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s). @@ -64,11 +68,15 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short ----------------------------------------------------------------- ``` -`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. +`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough and the benchmarks prove that it offers better performance, you should use it instead of `Vector128`. Namely, `Vector256` on x86/x64 is mostly treated as `2x Vector128` and while there are some operations that can "cross lanes", they can sometimes be more expensive or have other hidden costs. + +To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. -The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128.Count` and `Vector256.Count` return the number of elements of the given type T in a single vector. +The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). `Vector128.Count` and `Vector256.Count` return the number of elements of the given type T in a single vector. Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time. +**Note:** When `Vector256` is accelerated then `Vector128` and `Vector64` are also accelerated. + That is why the code is very often structured like this: ```cs @@ -89,6 +97,26 @@ void CodeStructure(ReadOnlySpan buffer) } ``` +To reduce the number of comparisons for small inputs, we can re-arrange it in the following way: + +```cs +void OptimalCodeStructure(ReadOnlySpan buffer) +{ + if (!Vector128.IsHardwareAccelerated || buffer.Length < Vector128.Count) + { + // scalar code path + } + else if (!Vector256.IsHardwareAccelerated || buffer.Length < Vector256.Count) + { + // Vector128 code path + } + else + { + // Vector256 code path + } +} +``` + **Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block: ```cs @@ -104,7 +132,7 @@ Such a code structure requires us to **test all possible code paths**: * `Vector256` is accelerated: * The input is large enough to benefit from vectorization with `Vector256`. - * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128` (when `Vector256` is accelerated then `Vector128` and smaller vectors are also). + * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128`. * The input is too small to benefit from any kind of vectorization. * `Vector128` is accelerated * The input is large enough to benefit from vectorization with `Vector128`. @@ -121,6 +149,8 @@ Assuming that we run the tests on an `x64` machine that supports `Vector256`, we * `DOTNET_EnableAVX2=0` * `DOTNET_EnableHWIntrinsic=0` +The alternative is running tests on enough variation of hardware to cover all the paths. + ### Benchmarking All that complexity needs to pay off. We need to **benchmark the code to verify that the investment is beneficial**. We can do that with [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet). @@ -231,7 +261,7 @@ Explaining benchmark design guidelines is outside of the scope of this document, The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory. -You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) has two chapters dedicated to statistics and can help you get a very good understanding of the subject. +You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but a book on statistics, such as [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) can help you get a very good understanding of the subject. No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse. @@ -287,7 +317,7 @@ int Sum(Span buffer) **Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it. -**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! +**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference). **Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128` and `Vector256` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods. @@ -308,7 +338,8 @@ bool Contains(Span buffer, int searched) ref int searchSpace = ref MemoryMarshal.GetReference(buffer); nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128.Count); - for (nuint elementOffset = 0; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128.Count) + nuint elementOffset = 0; + for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128.Count) { loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); // compare the loaded vector with searched value vector @@ -319,7 +350,7 @@ bool Contains(Span buffer, int searched) } // If any elements remain, process the last vector in the search space. - if (buffer.Length % Vector128.Count != 0) + if (elementOffset != (uint)buffer.Length) { loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd); if (Vector128.Equals(loaded, values) != Vector128.Zero) @@ -373,7 +404,7 @@ public static class Vector128 } ``` -The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right: +The first three overloads require a pointer to the source. To be able to use a pointer to a managed buffer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right: ```cs unsafe int UnmanagedPointersSum(Span buffer) @@ -418,6 +449,8 @@ The fourth method expects only a managed reference (`ref T source`). We don't ne ```cs int ManagedReferencesSum(int[] buffer) { + Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); + ref int current = ref MemoryMarshal.GetArrayDataReference(buffer); ref int end = ref Unsafe.Add(ref current, buffer.Length); ref int oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128.Count); @@ -444,7 +477,7 @@ int ManagedReferencesSum(int[] buffer) } ``` -**Note:** `Unsafe` does not expose a method called "IsGreaterOrEqualThan", so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect. +**Note:** `Unsafe` does not expose a method called `IsLessThanOrEqualTo`, so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect. **Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple: @@ -479,7 +512,7 @@ while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace)); Which could return true because `currentSearchSpace` was invalid and not updated. If you are interested in more details, you can check the [issue](https://github.com/dotnet/runtime/issues/75792#issuecomment-1249973858) and the [fix](https://github.com/dotnet/runtime/pull/75857). -That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic!** +That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.** ```cs public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; @@ -520,7 +553,7 @@ public static void StoreUnsafe(this Vector128 source, ref T destination, n ### Casting -As mentioned before, `Vector128` and `Vector256` are constrained to a specific set of primitive types. `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution. +As mentioned before, `Vector128` and `Vector256` are constrained to a specific set of primitive types. Currently, `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution. [Unsafe.As](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type: @@ -554,7 +587,7 @@ void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer) } ``` -We should avoid doing this in the opposite direction, as most engineers will assume that unmanaged pointers are already pinned. +It's only safe to convert a managed reference to a pointer if it's known that the reference is already pinned. If it's not, the moment after you get the pointer it could be invalid. ## Mindset @@ -653,6 +686,8 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop `Vector128`, `Vector128`, `Vector256` and `Vector256` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed. +**Note:** all of these methods have "software fallbacks", which are executed when they cannot be vectorized on given platform. + ### Creation Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value. From 06105b791e1308aad7b69bc37af87a2d23500776 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Thu, 6 Apr 2023 17:02:04 +0200 Subject: [PATCH 09/14] Apply suggestions from code review Co-authored-by: Jeff Handley Co-authored-by: Stephen Toub --- docs/coding-guidelines/vectorization-guidelines.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 72efccf6ca0b2a..a369deeac055ae 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -38,7 +38,7 @@ Vectorization is the art of converting an algorithm from operating on a single v In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code. -.NET 7 introduced a set of new APIs for `Vector64`, `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code (`Vector512` is being introduced in .NET 8). The purpose of this document is to introduce you to the new APIs and provide a set of best practices. +.NET 7 introduced a set of new APIs for `Vector64`, `Vector128` and `Vector256` for writing hardware-agnostic, cross platform vectorized code. Similarly, .NET 8 introduced `Vector512`. The purpose of this document is to introduce you to the new APIs and provide a set of best practices. ## Code structure @@ -50,9 +50,9 @@ In recent releases, .NET has introduced many new APIs for vectorization. The vas * `short` and `ushort` (16 bits). * `int`, `uint` and `float` (32 bits). * `long`, `ulong` and `double` (64 bits). -* `nint` and `unit` (32 or 64 bits, depending on the architecture, available in .NET 7+) +* `nint` and `nuint` (32 or 64 bits, depending on the architecture, available in .NET 7+) -.NET 8 is introducing a `Vector128.IsSupported` that helps identify whether a given `T` will throw or not to help identify what works per runtime, including from generic contexts. +.NET 8 introduced a `Vector128.IsSupported` that indicates whether a given `T` will throw to help identify what works per runtime, including from generic contexts. A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s). @@ -601,7 +601,7 @@ Before we start working on the implementation, let's list all edge cases for our * It does not need to throw any argument exceptions, as `ReadOnlySpan` is `struct` and it can never be `null` or invalid. * It should return `true` for an empty buffer. -* It should detect invalid characters in the entire buffer, including the remainder. +* It should detect invalid characters in the entire buffer, regardless of the buffer's length or whether its length is an even multiple of a vector width. * It should not read any bytes that don't belong to the provided buffer. ### Scalar solution @@ -1012,7 +1012,7 @@ public static unsafe Vector128 Narrow(Vector128 lower, Vector128 Narrow(Vector128 lower, Vector128 upper) ``` -In contrary to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector: +In contrast to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector: ```cs From 647b02f8f5527b51b73d515370218d5606b2fc26 Mon Sep 17 00:00:00 2001 From: Jeff Handley Date: Thu, 6 Apr 2023 18:12:41 -0700 Subject: [PATCH 10/14] Address some of the review feedback --- .../vectorization-guidelines.md | 193 ++++++++++++------ 1 file changed, 125 insertions(+), 68 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index a369deeac055ae..94463579b42d66 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -42,7 +42,7 @@ In recent releases, .NET has introduced many new APIs for vectorization. The vas ## Code structure -`Vector128` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector of type `T`. +`Vector128` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector containing elements of type `T`. `T` is constrained to specific primitive types: @@ -68,18 +68,73 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short ----------------------------------------------------------------- ``` -`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, and the data is large enough and the benchmarks prove that it offers better performance, you should use it instead of `Vector128`. Namely, `Vector256` on x86/x64 is mostly treated as `2x Vector128` and while there are some operations that can "cross lanes", they can sometimes be more expensive or have other hidden costs. +`Vector256` is twice as big as `Vector128`, so when it is hardware accelerated, the data is large enough, and the benchmarks prove that it offers better performance, you should consider using it instead of `Vector128`. Benchmarking your code can be important as not all platforms treat larger vectors the same. -To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties. +For example, `Vector256` on x86/x64 is mostly treated as `2x Vector128` rather than `1x Vector256`, where each `Vector128` is considered a "lane". For most operations, this doesn't present any additional considerations they only operate on individual elements of the vector. However, some operations could "cross lanes" such as shuffling or pairwise operations and that may require additional overhead to handle. -The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). `Vector128.Count` and `Vector256.Count` return the number of elements of the given type T in a single vector. -Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time. +As an example, consider `Add(Vector128 lhs, Vector128 rhs)` where you end up effectively doing (pseudo-code): +```csharp +result[0] = lhs[0] + rhs[0]; +result[1] = lhs[1] + rhs[1]; +result[2] = lhs[2] + rhs[2]; +result[3] = lhs[3] + rhs[3]; +``` + +With this algorithm it doesn't matter what size vector we have as we're accessing the same index of the input vectors and only one at a time. So regardless of whether we have `Vector128` or `Vector256` or `Vector512`, it all operates the same. + +However, if you then consider `AddPairwise(Vector128 lhs, Vector128 rhs)` (sometimes called `HorizontalAdd`) where you instead end up effectively doing: +```csharp +// process left +result[0] = lhs[0] + lhs[1]; +result[1] = lhs[2] + lhs[3]; +// process right +result[2] = rhs[0] + rhs[1]; +result[3] = rhs[2] + rhs[3]; +``` -**Note:** When `Vector256` is accelerated then `Vector128` and `Vector64` are also accelerated. +You may notice that this algorithm would change behavior if expanded up to operate on a single 256-bit vector (note `result[2]` is now `lhs[4] + lhs[6]` and not `rhs[0] + rhs[1]`): +```csharp +// process left +result[0] = lhs[0] + lhs[1]; +result[1] = lhs[2] + lhs[3]; +result[2] = lhs[4] + lhs[5]; +result[3] = lhs[6] + lhs[7]; +// process right +result[4] = rhs[0] + rhs[1]; +result[5] = rhs[2] + rhs[3]; +result[6] = rhs[4] + rhs[5]; +result[7] = rhs[6] + rhs[7]; +``` + +Because this behavior would change, the x86/x64 platform opted to treat the operation as `2x Vector128` inputs giving you instead: +```csharp +// process lower left +result[0] = lhs[0] + lhs[1]; +result[1] = lhs[2] + lhs[3]; +// process lower right +result[2] = rhs[0] + rhs[1]; +result[3] = rhs[2] + rhs[3]; +// process upper left +result[4] = lhs[4] + lhs[5]; +result[5] = lhs[6] + lhs[7]; +// process upper right +result[6] = rhs[4] + rhs[5]; +result[7] = rhs[6] + rhs[7]; +``` -That is why the code is very often structured like this: +This ends up preserving behavior and making it much easier to transition from `128-bit` to `256-bit` or higher as you're effectively just unrolling the loop again. It does, however, mean that some algorithms may need additional handling if you need to truly do anything involving the upper and lower lanes together. The exact additional expense here depends on what is being done, what the underlying hardware supports, and several other factors covered in more detail later. -```cs +### Checking for Hardware Acceleration + +To check if a given vector size is hardware accelerated, use the `IsHardwareAccelerated` property on the relevant non-generic vector class. For example, `Vector128.IsHardwareAccelerated` or `Vector256.IsHardwareAccelerated`. Note that even when a vector size is accelerated, there may still be some operations that are not hardware-accelerated; e.g. floating-point division can be accelerated on some hardware while integer division is not. + +The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). The `Count` properties (for example `Vector128.Count` or `Vector256.Count`) return the number of elements of the given type T in a single vector. + +When `Vector256` is accelerated, `Vector128` generally will be as well, but there's no guarantee of that. The best practice is to always check `IsHardwareAccelerated` explicitly. You may be tempted to cache the values from the `IsHardwareAccelerated` and `Count` properties, but this is not needed or recommended. Both `IsHardwareAccelerated` and `Count` are turned into constants by the Just-In-Time compiler and no method call is required to retrieve the information. + +### Example Code Structure + +```csharp void CodeStructure(ReadOnlySpan buffer) { if (Vector256.IsHardwareAccelerated && buffer.Length >= Vector256.Count) @@ -99,7 +154,7 @@ void CodeStructure(ReadOnlySpan buffer) To reduce the number of comparisons for small inputs, we can re-arrange it in the following way: -```cs +```csharp void OptimalCodeStructure(ReadOnlySpan buffer) { if (!Vector128.IsHardwareAccelerated || buffer.Length < Vector128.Count) @@ -117,9 +172,11 @@ void OptimalCodeStructure(ReadOnlySpan buffer) } ``` -**Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block: +**Both vector types provide the same functionality**, but arm64 hardware does not support `Vector256`, so for the sake of simplicity we will be using `Vector128` in all examples. All examples shown also assume **little endian** architecture and/or do not need to deal with endianness. `BitConverter.IsLittleEndian` is available (and turned into a constant by the JIT) for algorithms that need to consider endianness. + +With these assumptions, all examples shown in the document assume that they are being executed as part of the following `if` block: -```cs +```csharp else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count) { // Vector128 code path @@ -159,7 +216,7 @@ All that complexity needs to pay off. We need to **benchmark the code to verify It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios: -```cs +```csharp static void Main(string[] args) { Job enough = Job.Default @@ -200,13 +257,13 @@ We have three possibilities: * We can enforce the alignment ourselves and have very stable results. * We can ask the harness to try to randomize the memory and observe the entire possible distribution with each run. -* We can do nothing and wonder why the results vary from time to time. +* We can do nothing and wonder why the results have additional noise across many runs. ##### Enforcing memory alignment We can allocate aligned unmanaged memory by using the [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). -```cs +```csharp public unsafe class Benchmarks { private void* _pointer; @@ -279,7 +336,7 @@ Example: our input is a buffer of ten integers, assuming that `Vector128` is acc Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128 bits / 32 bits) integers in one iteration. In the second loop, we handle the remaining values. -```cs +```csharp int Sum(Span buffer) { Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); @@ -323,11 +380,11 @@ int Sum(Span buffer) ### Vectorized remainder handling -Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. +There are scenarios and advanced techniques that can allow for vectorized remainder handling instead of resorting to the non-vectorized approach illustrated above. Some algorithms could use an approach of backtracking to load one more vector's worth of elements and masking off elements that have already been processed. For idempotent algorithms, it is preferable to simply backtrack and process one last vector, repeating the operation for elements as needed. -Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements. +In the example below, we need to check whether the given buffer contains a specific number; processing values more than once is completely acceptable. The buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements. -```cs +```csharp bool Contains(Span buffer, int searched) { Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); @@ -365,7 +422,7 @@ bool Contains(Span buffer, int searched) `Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128.Zero` is equivalent to `Vector128.Create(0)`. -`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match. +`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector where each element is either all-bits-set or zero, depending on if the corresponding elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match. ### Access violation (AV) testing @@ -393,7 +450,7 @@ Writing tests that detect that issue is hard, but not impossible. The .NET Team Both `Vector128` and `Vector256` provide at least five ways of loading them from memory: -```cs +```csharp public static class Vector128 { public static Vector128 Load(T* source) where T : unmanaged; @@ -406,7 +463,7 @@ public static class Vector128 The first three overloads require a pointer to the source. To be able to use a pointer to a managed buffer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right: -```cs +```csharp unsafe int UnmanagedPointersSum(Span buffer) { fixed (int* pBuffer = buffer) @@ -438,7 +495,7 @@ unsafe int UnmanagedPointersSum(Span buffer) } ``` -`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. +`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. "NonTemporal" means that the hardware is allowed (but not required) to bypass the cache. Non-temporal reads provide a speedup when working with very large amounts of data as it avoids repeatedly filling the cache with values that will never be used again. Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. @@ -446,7 +503,7 @@ The alternative to creating aligned buffers (we don't always have the control ov The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic: -```cs +```csharp int ManagedReferencesSum(int[] buffer) { Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128.Count); @@ -481,7 +538,7 @@ int ManagedReferencesSum(int[] buffer) **Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple: -```cs +```csharp ref TValue currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128.Count); do @@ -500,13 +557,13 @@ while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace)); It was part of `LastIndexOf` implementation, where we were iterating from the end to the beginning of the buffer. In the last iteration of the loop, `currentSearchSpace` could become a pointer to unknown memory that lied before the beginning of the buffer: -```cs +```csharp currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128.Count); ``` And it was fine until GC kicked right after that, moved objects in memory, updated all valid managed references and resumed the execution, which run following condition: -```cs +```csharp while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace)); ``` @@ -514,13 +571,13 @@ Which could return true because `currentSearchSpace` was invalid and not updated That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.** -```cs +```csharp public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; ``` **The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.** -```cs +```csharp Span buffer = new int[2] { 1, 2 }; nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128.Count); Console.WriteLine(oneVectorAwayFromEnd); @@ -532,7 +589,7 @@ Can you guess the result? For a 64 bit process it's `FFFFFFFFFFFFFFFE` (a hex re Similarly to loading, both `Vector128` and `Vector256` provide at least five ways of storing them in memory: -```cs +```csharp public static class Vector128 { public static void Store(this Vector128 source, T* destination) where T : unmanaged; @@ -545,7 +602,7 @@ public static class Vector128 For the reasons described for loading, we recommend using the overload that takes managed reference and element offset: -```cs +```csharp public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct; ``` @@ -557,7 +614,7 @@ As mentioned before, `Vector128` and `Vector256` are constrained to a spec [Unsafe.As](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type: -```cs +```csharp void CastingReferences(Span buffer) { ref char charSearchSpace = ref MemoryMarshal.GetReference(buffer); @@ -568,7 +625,7 @@ void CastingReferences(Span buffer) Or [MemoryMarshal.Cast](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.memorymarshal.cast#system-runtime-interopservices-memorymarshal-cast-2(system-readonlyspan((-0)))), which casts a span of one primitive type to a span of another primitive type: -```cs +```csharp void CastingSpans(Span chars) { Span shorts = MemoryMarshal.Cast(chars); @@ -577,7 +634,7 @@ void CastingSpans(Span chars) It's also possible to get managed references from unmanaged pointers: -```cs +```csharp void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer) { // of the same type: @@ -621,7 +678,7 @@ most significant bit When we look at it, we can realize that another way is checking whether the most significant bit is equal `1`. For the scalar version, we could perform a logical AND: -```cs +```csharp bool IsValidAscii(byte c) => (c & 0b1000_0000) == 0; ``` @@ -631,7 +688,7 @@ Another step is vectorizing our scalar solution and choosing the best way of doi If we reuse one of the loops presented in the previous sections, all we need to implement is a method that accepts `Vector128` and returns `bool` and does exactly the same thing that our scalar method did, but for a vector rather than single value: -```cs +```csharp [MethodImpl(MethodImplOptions.AggressiveInlining)] bool VectorContainsNonAsciiChar(Vector128 asciiVector) { @@ -648,7 +705,7 @@ bool VectorContainsNonAsciiChar(Vector128 asciiVector) We can also use the hardware-specific instructions if they are available: -```cs +```csharp if (Sse41.IsSupported) { return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000)); @@ -692,13 +749,13 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value. -```cs +```csharp public static Vector128 Create(T value) where T : struct; ``` `CreateScalar` initializes first element to the specified value, and the remaining elements to zero. -```cs +```csharp public static Vector128 CreateScalar(int value); ``` @@ -707,19 +764,19 @@ public static Vector128 CreateScalar(int value); We also have an overload that allows for specifying every value in given vector: -```cs +```csharp public static Vector128 Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7) ``` And last but not least we have a `Create` overload which accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ.Count` elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (for performance). -```cs +```csharp public static Vector128 Create(ReadOnlySpan values) where T : struct ``` to perform a copy in the other direction, we can use one of the `CopyTo` extension methods: -```cs +```csharp public static void CopyTo(this Vector128 vector, Span destination) where T : struct ``` @@ -729,7 +786,7 @@ All size-specific vector types provide a set of APIs for common bit operations. `BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`). -```cs +```csharp public static Vector128 BitwiseAnd(Vector128 left, Vector128 right) where T : struct => left & right; public static Vector128 BitwiseOr(Vector128 left, Vector128 right) where T : struct => left | right; public static Vector128 Xor(Vector128 left, Vector128 right) => left ^ right; @@ -738,14 +795,14 @@ public static Vector128 Negate(Vector128 vector) => ~vector; `AndNot` computes the bitwise-and of a given vector and the ones' complement of another vector. -```cs +```csharp public static Vector128 AndNot(Vector128 left, Vector128 right) => left & ~right; ``` `ShiftLeft` shifts each element of a vector left by the specified number of bits. `ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift: -```cs +```csharp public static Vector128 ShiftLeft(Vector128 vector, int shiftCount); public static Vector128 ShiftRightArithmetic(Vector128 vector, int shiftCount); public static Vector128 ShiftRightLogical(Vector128 vector, int shiftCount); @@ -755,20 +812,20 @@ public static Vector128 ShiftRightLogical(Vector128 vector, int shif `EqualsAll` compares two vectors to determine if all elements are equal. `EqualsAny` compares two vectors to determine if any elements are equal. -```cs +```csharp public static bool EqualsAll(Vector128 left, Vector128 right) where T : struct => left == right; public static bool EqualsAny(Vector128 left, Vector128 right) where T : struct ``` `Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in the `left` and `right` arguments were equal. -```cs +```csharp public static Vector128 Equals(Vector128 left, Vector128 right) where T : struct ``` How do we calculate the index of the first match? Let's take a closer look at the result of following equality check: -```cs +```csharp Vector128 left = Vector128.Create(1, 2, 3, 4); Vector128 right = Vector128.Create(0, 0, 3, 0); Vector128 equals = Vector128.Equals(left, right); @@ -781,13 +838,13 @@ Console.WriteLine(equals); `-1` is just `0xFFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element. -```cs +```csharp public static T GetElement(this Vector128 vector, int index) where T : struct ``` But it would not be an optimal solution. We should instead extract the most significant bits: -```cs +```csharp uint mostSignificantBits = equals.ExtractMostSignificantBits(); Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0')); ``` @@ -802,7 +859,7 @@ To calculate the last index, we should use [BitOperations.LeadingZeroCount](http If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer. -```cs +```csharp int ComputeLastIndex(nint elementOffset, Vector128 equals) where T : struct { uint mostSignificantBits = equals.ExtractMostSignificantBits(); @@ -815,7 +872,7 @@ int ComputeLastIndex(nint elementOffset, Vector128 equals) where T : struc If we were using the `Load` overload that takes only the managed reference, we could use [Unsafe.ByteOffset(ref T, ref T)](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.byteoffset) to calculate the element offset. -```cs +```csharp unsafe int ComputeFirstIndex(ref T searchSpace, ref T current, Vector128 equals) where T : struct { int elementOffset = (int)Unsafe.ByteOffset(ref searchSpace, ref current) / sizeof(T); @@ -831,7 +888,7 @@ unsafe int ComputeFirstIndex(ref T searchSpace, ref T current, Vector128 e Beside equality checks, vector APIs allow for comparison. The `bool`-returning overloads return `true` when the given condition is true: -```cs +```csharp public static bool GreaterThanAll(Vector128 left, Vector128 right) where T : struct public static bool GreaterThanAny(Vector128 left, Vector128 right) where T : struct public static bool GreaterThanOrEqualAll(Vector128 left, Vector128 right) where T : struct @@ -844,7 +901,7 @@ public static bool LessThanOrEqualAny(Vector128 left, Vector128 right) Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in `left` and `right` meet the given condition. -```cs +```csharp public static Vector128 GreaterThan(Vector128 left, Vector128 right) where T : struct public static Vector128 GreaterThanOrEqual(Vector128 left, Vector128 right) where T : struct public static Vector128 LessThan(Vector128 left, Vector128 right) where T : struct @@ -853,13 +910,13 @@ public static Vector128 LessThanOrEqual(Vector128 left, Vector128 ri `ConditionalSelect` Conditionally selects a value from two vectors on a bitwise basis. -```cs +```csharp public static Vector128 ConditionalSelect(Vector128 condition, Vector128 left, Vector128 right) ``` This method deserves a self-describing example: -```cs +```csharp Vector128 left = Vector128.Create(1.0f, 2, 3, 4); Vector128 right = Vector128.Create(4.0f, 3, 2, 1); @@ -872,7 +929,7 @@ Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result); Very simple math operations can be also expressed by using the operators: -```cs +```csharp public static Vector128 Add(Vector128 left, Vector128 right) where T : struct => left + right; public static Vector128 Divide(Vector128 left, Vector128 right) => left / right; public static Vector128 Divide(Vector128 left, T right) => left / right; @@ -885,7 +942,7 @@ public static Vector128 Subtract(Vector128 left, Vector128 right) => `Abs`, `Ceiling`, `Floor`, `Max`, `Min`, `Sqrt` and `Sum` are also provided: -```cs +```csharp public static Vector128 Abs(Vector128 vector) where T : struct public static Vector128 Ceiling(Vector128 vector) public static Vector128 Floor(Vector128 vector) @@ -899,7 +956,7 @@ public static T Sum(Vector128 vector) where T : struct Vector types provide a set of methods dedicated to number conversions: -```cs +```csharp public static unsafe Vector128 ConvertToDouble(Vector128 vector) public static unsafe Vector128 ConvertToDouble(Vector128 vector) public static unsafe Vector128 ConvertToInt32(Vector128 vector) @@ -912,7 +969,7 @@ public static unsafe Vector128 ConvertToUInt64(Vector128 vector) And for reinterpretation (no values are being changed, they can be just used as if they were of a different type): -```cs +```csharp public static Vector128 As(this Vector128 vector) public static Vector128 AsByte(this Vector128 vector) public static Vector128 AsDouble(this Vector128 vector) @@ -946,21 +1003,21 @@ The first half of every vector is called "lower", the second is "upper". In case of `Vector128`, `GetLower` gets the value of the lower 64-bits as a new `Vector64` and `GetUpper` gets the upper 64-bits. -```cs +```csharp public static Vector64 GetLower(this Vector128 vector) public static Vector64 GetUpper(this Vector128 vector) ``` Each vector type provides a `Create` method that allows for the creation from lower and upper: -```cs +```csharp public static unsafe Vector128 Create(Vector64 lower, Vector64 upper) public static Vector256 Create(Vector128 lower, Vector128 upper) ``` `Lower` and `Upper` are also used by `Widen`. This method widens a `Vector128` into two `Vector128` where `sizeof(T2) == 2 * sizeof(T1)`. -```cs +```csharp public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vector128 source) @@ -972,14 +1029,14 @@ public static unsafe (Vector128 Lower, Vector128 Upper) Widen(Vect It's also possible to widen only the lower or upper part: -```cs +```csharp public static Vector128 WidenLower(Vector128 source) public static Vector128 WidenUpper(Vector128 source) ``` An example of widening is converting a buffer of ASCII bytes into characters: -```cs +```csharp byte[] byteBuffer = Enumerable.Range('A', 128 / 8).Select(i => (byte)i).ToArray(); Vector128 byteVector = Vector128.Create(byteBuffer); Console.WriteLine(byteVector); @@ -1002,7 +1059,7 @@ ABCDEFGHIJKLMNOP `Narrow` is the opposite of `Widen`. -```cs +```csharp public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) public static unsafe Vector128 Narrow(Vector128 lower, Vector128 upper) @@ -1015,7 +1072,7 @@ public static unsafe Vector128 Narrow(Vector128 lower, Vector128
    ushortVector = Vector256.Create((ushort)300); Console.WriteLine(ushortVector); unchecked { Console.WriteLine((byte)300); } @@ -1040,7 +1097,7 @@ if (Sse2.IsSupported) `Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes of the input vector). -```cs +```csharp public static Vector128 Shuffle(Vector128 vector, Vector128 indices) public static Vector128 Shuffle(Vector128 vector, Vector128 indices) public static Vector128 Shuffle(Vector128 vector, Vector128 indices) @@ -1051,7 +1108,7 @@ public static Vector128 Shuffle(Vector128 vector, Vector128 intVector = Vector128.Create(100, 200, 300, 400); Console.WriteLine(intVector); Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0))); From a4278c40a261b42be98ff20e8a6cebc389b140cd Mon Sep 17 00:00:00 2001 From: Jeff Handley Date: Thu, 6 Apr 2023 18:23:34 -0700 Subject: [PATCH 11/14] Fix spelling/hyphenization in a couple places --- docs/coding-guidelines/vectorization-guidelines.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 94463579b42d66..7700c792750a0f 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -372,7 +372,7 @@ int Sum(Span buffer) } ``` -**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it. +**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never de-reference it. **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference). @@ -497,7 +497,7 @@ unsafe int UnmanagedPointersSum(Span buffer) `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. "NonTemporal" means that the hardware is allowed (but not required) to bypass the cache. Non-temporal reads provide a speedup when working with very large amounts of data as it avoids repeatedly filling the cache with values that will never be used again. -Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. +Currently .NET exposes only one API for allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware. @@ -739,7 +739,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**. -## Toolchain +## Tool-Chain `Vector128`, `Vector128`, `Vector256` and `Vector256` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed. @@ -1131,7 +1131,7 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas - If you are already an expert and you have vectorized your code for both `x64/x86` and `arm64/arm` code you can use the new APIs to simplify your code, but you most likely won't observe any performance gains. [#64451](https://github.com/dotnet/runtime/issues/64451) lists the places where it was/can be done in dotnet/runtime. You can use links to the merged PRs to see real-life examples. - If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the new APIs to have a single, cross-platform implementation. -- If you have already vectorized your code with `Vector` you can use the new APIs to check if they can produce better codegen. +- If you have already vectorized your code with `Vector` you can use the new APIs to check if they can produce better code-gen. - If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths. ### Best practices From 3c46ba492c6d16e7a039078dd8ca6dc04c159262 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Fri, 19 May 2023 18:45:39 +0200 Subject: [PATCH 12/14] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Günther Foidl --- docs/coding-guidelines/vectorization-guidelines.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index 7700c792750a0f..bd8499bc36f0e9 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -499,7 +499,7 @@ unsafe int UnmanagedPointersSum(Span buffer) Currently .NET exposes only one API for allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers. -The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware. +The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code may not always be worth it and needs to be proved with proper benchmarking on various hardware. The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic: @@ -912,6 +912,7 @@ public static Vector128 LessThanOrEqual(Vector128 left, Vector128 ri ```csharp public static Vector128 ConditionalSelect(Vector128 condition, Vector128 left, Vector128 right) + => (left & condition) | (right & ~condition); ``` This method deserves a self-describing example: @@ -1144,5 +1145,5 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas 6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow! 7. Always handle the vectorized loop remainder. 8. When storing values in memory, be aware of a potential buffer overlap. -9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. Over time, you may gain enough experience to skip the scalar step. +9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. 10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed. From 556e64b9049354969e0576666f6b40843ad92d94 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Thu, 25 May 2023 17:20:53 +0200 Subject: [PATCH 13/14] addressing the code review comments that don't require the reordering of introduced concepts --- .../vectorization-guidelines.md | 78 ++++++++++--------- 1 file changed, 43 insertions(+), 35 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index bd8499bc36f0e9..bec4035c7cde66 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -214,7 +214,7 @@ All that complexity needs to pay off. We need to **benchmark the code to verify #### Custom config -It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios: +It's possible to define a config that instructs the harness to run the benchmarks for all three scenarios: ```csharp static void Main(string[] args) @@ -310,7 +310,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co | Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B | ``` -The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large and its entire contents are searched through, as the value is never found). +The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large, aligned and its entire contents are searched through, as the value is never found). Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements. @@ -374,7 +374,11 @@ int Sum(Span buffer) **Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never de-reference it. -**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference). +**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you can use [ReadOnlySpan.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference) or just do the following: + +```csharp +ref readonly T searchSpace = ref MemoryMarshal.GetReference(buffer); +``` **Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128` and `Vector256` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods. @@ -453,11 +457,11 @@ Both `Vector128` and `Vector256` provide at least five ways of loading them from ```csharp public static class Vector128 { - public static Vector128 Load(T* source) where T : unmanaged; - public static Vector128 LoadAligned(T* source) where T : unmanaged; - public static Vector128 LoadAlignedNonTemporal(T* source) where T : unmanaged; - public static Vector128 LoadUnsafe(ref T source) where T : struct; - public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; + public static Vector128 Load(T* source) where T : unmanaged + public static Vector128 LoadAligned(T* source) where T : unmanaged + public static Vector128 LoadAlignedNonTemporal(T* source) where T : unmanaged + public static Vector128 LoadUnsafe(ref T source) where T : struct + public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct } ``` @@ -572,7 +576,7 @@ Which could return true because `currentSearchSpace` was invalid and not updated That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.** ```csharp -public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct; +public static Vector128 LoadUnsafe(ref T source, nuint elementOffset) where T : struct ``` **The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.** @@ -592,18 +596,18 @@ Similarly to loading, both `Vector128` and `Vector256` provide at least five way ```csharp public static class Vector128 { - public static void Store(this Vector128 source, T* destination) where T : unmanaged; - public static void StoreAligned(this Vector128 source, T* destination) where T : unmanaged; - public static void StoreAlignedNonTemporal(this Vector128 source, T* destination) where T : unmanaged; - public static void StoreUnsafe(this Vector128 source, ref T destination) where T : struct; - public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct; + public static void Store(this Vector128 source, T* destination) where T : unmanaged + public static void StoreAligned(this Vector128 source, T* destination) where T : unmanaged + public static void StoreAlignedNonTemporal(this Vector128 source, T* destination) where T : unmanaged + public static void StoreUnsafe(this Vector128 source, ref T destination) where T : struct + public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct } ``` For the reasons described for loading, we recommend using the overload that takes managed reference and element offset: ```csharp -public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct; +public static void StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset) where T : struct ``` **Note**: when loading values from one buffer and storing them into another, we need to consider whether they overlap or not. [MemoryExtensions.Overlap](https://learn.microsoft.com/dotnet/api/system.memoryextensions.overlaps#system-memoryextensions-overlaps-1(system-readonlyspan((-0))-system-readonlyspan((-0)))) is an API for doing that. @@ -690,16 +694,16 @@ If we reuse one of the loops presented in the previous sections, all we need to ```csharp [MethodImpl(MethodImplOptions.AggressiveInlining)] -bool VectorContainsNonAsciiChar(Vector128 asciiVector) +bool IsValidAscii(Vector128 vector) { // to perform "> 127" check we can use GreaterThanAny method: - return Vector128.GreaterThanAny(asciiVector, Vector128.Create((byte)127)) + return !Vector128.GreaterThanAny(vector, Vector128.Create((byte)127)) // to perform "< 0" check, we need to use AsSByte and LessThanAny methods: - return Vector128.LessThanAny(asciiVector.AsSByte(), Vector128.Zero) + return !Vector128.LessThanAny(vector.AsSByte(), Vector128.Zero) // to perform an AND operation, we need to use & operator - return (asciiVector & Vector128.Create((byte)0b_1000_0000)) != Vector128.Zero; + return (vector & Vector128.Create((byte)0b_1000_0000)) == Vector128.Zero; // we can also just use ExtractMostSignificantBits method: - return asciiVector.ExtractMostSignificantBits() != 0; + return vector.ExtractMostSignificantBits() == 0; } ``` @@ -708,12 +712,12 @@ We can also use the hardware-specific instructions if they are available: ```csharp if (Sse41.IsSupported) { - return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000)); + return Sse41.TestZ(vector, Vector128.Create((byte)0b_1000_0000)); } else if (AdvSimd.Arm64.IsSupported) { - Vector128 maxBytes = AdvSimd.Arm64.MaxPairwise(asciiVector, asciiVector); - return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) != 0; + Vector128 maxBytes = AdvSimd.Arm64.MaxPairwise(vector, vector); + return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) == 0; } ``` @@ -737,7 +741,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co | ExtractMostSignificantBits | 1024 | 27.33 ns | 0.11 | 141 B | ``` -Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**. +Even such a simple problem can be solved in at least 5 different ways and each of them can perform significantly different on different hardware. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**. ## Tool-Chain @@ -750,13 +754,13 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value. ```csharp -public static Vector128 Create(T value) where T : struct; +public static Vector128 Create(T value) where T : struct ``` `CreateScalar` initializes first element to the specified value, and the remaining elements to zero. ```csharp -public static Vector128 CreateScalar(int value); +public static Vector128 CreateScalar(int value) ``` `CreateScalarUnsafe` is similar, but the remaining elements are left uninitialized. It's dangerous! @@ -786,6 +790,8 @@ All size-specific vector types provide a set of APIs for common bit operations. `BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`). +**Note:** The **operators should be preferred where possible**, as it helps avoid bugs around operator precedence and can improve readability. + ```csharp public static Vector128 BitwiseAnd(Vector128 left, Vector128 right) where T : struct => left & right; public static Vector128 BitwiseOr(Vector128 left, Vector128 right) where T : struct => left | right; @@ -803,9 +809,9 @@ public static Vector128 AndNot(Vector128 left, Vector128 right) => l `ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift: ```csharp -public static Vector128 ShiftLeft(Vector128 vector, int shiftCount); -public static Vector128 ShiftRightArithmetic(Vector128 vector, int shiftCount); -public static Vector128 ShiftRightLogical(Vector128 vector, int shiftCount); +public static Vector128 ShiftLeft(Vector128 vector, int shiftCount) => vector << shiftCount; +public static Vector128 ShiftRightArithmetic(Vector128 vector, int shiftCount) => vector >> shiftCount; +public static Vector128 ShiftRightLogical(Vector128 vector, int shiftCount) => vector >>> shiftCount; ``` ### Equality @@ -853,9 +859,9 @@ Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0')); 00000000000000000000000000000100 ``` -and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get the trailing zero count. +and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) or [uint.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.uint32.trailingzerocount) (introduced in .NET 7) to get the trailing zero count. -To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0). +To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount) or [uint.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.uint32.leadingzerocount) (introduced in .NET 7). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0). If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer. @@ -928,7 +934,7 @@ Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result); ### Math -Very simple math operations can be also expressed by using the operators: +Very simple math operations can be also expressed by using the operators. The operators should be preferred where possible, as it helps avoid bugs around operator precedence and can improve readability. ```csharp public static Vector128 Add(Vector128 left, Vector128 right) where T : struct => left + right; @@ -946,10 +952,12 @@ public static Vector128 Subtract(Vector128 left, Vector128 right) => ```csharp public static Vector128 Abs(Vector128 vector) where T : struct public static Vector128 Ceiling(Vector128 vector) +public static Vector128 Ceiling(Vector128 vector) +public static Vector128 Floor(Vector128 vector) public static Vector128 Floor(Vector128 vector) -public static Vector128 Max(Vector128 left, Vector128 right) -public static Vector128 Min(Vector128 left, Vector128 right) -public static Vector128 Sqrt(Vector128 vector); +public static Vector128 Max(Vector128 left, Vector128 right) where T : struct +public static Vector128 Min(Vector128 left, Vector128 right) where T : struct +public static Vector128 Sqrt(Vector128 vector) where T : struct public static T Sum(Vector128 vector) where T : struct ``` From 281bbb484ac467d205ee590e6f387e606abfd571 Mon Sep 17 00:00:00 2001 From: Adam Sitnik Date: Tue, 30 May 2023 19:43:00 +0200 Subject: [PATCH 14/14] more polishing: * update TOC * add note about imperfect perf boost * don't recommend managed references over unsafe pointers, as they can both be dangerous when used incorrectly --- .../vectorization-guidelines.md | 21 ++++++++++++------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md index bec4035c7cde66..ab85676263afd6 100644 --- a/docs/coding-guidelines/vectorization-guidelines.md +++ b/docs/coding-guidelines/vectorization-guidelines.md @@ -1,5 +1,7 @@ - [Introduction to vectorization with Vector128 and Vector256](#introduction-to-vectorization-with-vector128-and-vector256) * [Code structure](#code-structure) + + [Checking for Hardware Acceleration](#checking-for-hardware-acceleration) + + [Example Code Structure](#example-code-structure) + [Testing](#testing) + [Benchmarking](#benchmarking) - [Custom config](#custom-config) @@ -18,7 +20,7 @@ + [Edge cases](#edge-cases) + [Scalar solution](#scalar-solution) + [Vectorized solution](#vectorized-solution) - * [Toolchain](#toolchain) + * [Tool-Chain](#tool-chain) + [Creation](#creation) + [Bit operations](#bit-operations) + [Equality](#equality) @@ -27,6 +29,7 @@ + [Conversion](#conversion) + [Widening and Narrowing](#widening-and-narrowing) + [Shuffle](#shuffle) + - [Vector256.Shuffle vs Avx2.Shuffle](#vector256shuffle-vs-avx2shuffle) * [Summary](#summary) + [Best practices](#best-practices) @@ -310,6 +313,8 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co | Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B | ``` +**Note:** as you can see, even such simple method like [Contains](https://learn.microsoft.com/dotnet/api/system.memoryextensions.contains) **did not observe a perfect performance boost**: x8 for `Vector256` (256/32) and x4 for `Vector128` (128/32). To understand why, we would need to use a profiler that provides information on CPU instruction level, which depending on the hardware could be [Intel VTune](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html) or [amd uprof](https://developer.amd.com/amd-uprof/). + The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large, aligned and its entire contents are searched through, as the value is never found). Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements. @@ -1142,16 +1147,16 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas - If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the new APIs to have a single, cross-platform implementation. - If you have already vectorized your code with `Vector` you can use the new APIs to check if they can produce better code-gen. - If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths. +- Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff. ### Best practices 1. Implement tests that cover all code paths, including Access Violations. 2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that. 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users. -4. Prefer managed references over unsafe pointers to avoid pinning and safety issues. -5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly. -6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow! -7. Always handle the vectorized loop remainder. -8. When storing values in memory, be aware of a potential buffer overlap. -9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. -10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed. +4. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly. +5. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128 source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow! +6. Always handle the vectorized loop remainder. +7. When storing values in memory, be aware of a potential buffer overlap. +8. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. +9. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.