From ce1840f3b19d4cdf83c241fd3970a0c540007da0 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Thu, 30 Mar 2023 07:12:23 +0200
Subject: [PATCH 01/14] Introduction to vectorization with Vector128 and
 Vector256

---
 .../vectorization-guidelines.md               | 1060 +++++++++++++++++
 1 file changed, 1060 insertions(+)
 create mode 100644 docs/coding-guidelines/vectorization-guidelines.md
diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
new file mode 100644
index 00000000000000..394b91af3ba0c5
--- /dev/null
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -0,0 +1,1060 @@
+- [Introduction to vectorization with Vector128 and Vector256](#introduction-to-vectorization-with-vector128-and-vector256)
+  * [Code structure](#code-structure)
+    + [Testing](#testing)
+    + [Benchmarking](#benchmarking)
+      - [Custom config](#custom-config)
+      - [Memory alignment](#memory-alignment)
+        * [Enforcing memory alignment](#enforcing-memory-alignment)
+        * [Memory randomization](#memory-randomization)
+  * [Loops](#loops)
+    + [Scalar remainder handling](#scalar-remainder-handling)
+    + [Vectorized remainder handling](#vectorized-remainder-handling)
+    + [AV testing](#av-testing)
+  * [Loading and storing vectors](#loading-and-storing-vectors)
+    + [Loading](#loading)
+    + [Storing](#storing)
+    + [Casting](#casting)
+  * [Mindset](#mindset)
+    + [Edge cases](#edge-cases)
+    + [Scalar solution](#scalar-solution)
+    + [Vectorized solution](#vectorized-solution)
+  * [Toolchain](#toolchain)
+    + [Creation](#creation)
+    + [Bit operations](#bit-operations)
+    + [Equality](#equality)
+    + [Comparison](#comparison)
+    + [Math](#math)
+    + [Conversion](#conversion)
+    + [Widening and Narrowing](#widening-and-narrowing)
+    + [Shuffle](#shuffle)
+  * [Summary](#summary)
+    + [Best practices](#best-practices)
+
+TL;DR: Go to [Summary](#summary)
+
+# Introduction to vectorization with Vector128 and Vector256
+
+Vectorization is an art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity.
+
+In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code.
+
+.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` that aim for writing hardware-agnostic vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices.
+
+## Code structure
+
+`Vector128<T>` represents a 128-bit vector of type `T`. `T` is constrained to specific primitive types:
+
+* `byte` and `sbyte` (8 bits).
+* `short` and `ushort` (16 bits).
+* `int`, `uint` and `float` (32 bits).
+* `long`, `ulong` and `double` (64 bits).
+* `nint` and `unit` (32 or 64 bits, depending on the architecture)
+
+Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats and 2 (u)longs/double(s).
+
+```
+------------------------------128-bits---------------------------
+|             64                |               64              |
+-----------------------------------------------------------------
+|      32       |      32       |      32       |      32       |
+----------------------------------------------------------------|
+|  16   |  16   |  16   |  16   |  16   |  16   |  16   |  16   |
+-----------------------------------------------------------------
+| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
+-----------------------------------------------------------------
+```
+
+`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, we should prefer it over a `Vector128<T>`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
+
+We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector128<T>.Count` return the size of a vector of given type in bytes.
+Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT).
+
+That is why the code is very often structured like this:
+
+```cs
+void CodeStructure(ReadOnlySpan<byte> buffer)
+{
+    if (Vector256.IsHardwareAccelerated && buffer.Length >= Vector256<byte>.Count)
+    {
+        // Vector256 code path
+    }
+    else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<byte>.Count)
+    {
+        // Vector128 code path
+    }
+    else
+    {
+        // non-vectorized && small inputs code path
+    }
+}
+```
+
+**Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block:
+
+```cs
+else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<byte>.Count)
+{
+    // Vector128 code path
+}
+```
+
+### Testing
+
+Such a code structure requires us to **test all possible code paths**:
+
+* `Vector256` is accelerated:
+  * The input is large enough to benefit from vectorization with `Vector256`.
+  * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128` (when `Vector256` is accelerated then `Vector128` and smaller vectors are also).
+  * The input is too small to benefit from any kind of vectorization.
+* `Vector128` is accelerated
+  * The input is large enough to benefit from vectorization with `Vector128`.
+  * The input is too small to benefit from any kind of vectorization.
+* Neither `Vector128` or  `Vector256` are accelerated.
+
+It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started:
+
+* When `COMPlus_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`.
+* When `COMPlus_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`.
+* When `COMPlus_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`.
+
+Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with:
+* no custom settings
+* `COMPlus_EnableAVX2=0`
+* `COMPlus_EnableAVX=0` (it can be skipped if `Vector64<T>` and `Vector<T>` are not involved)
+* `COMPlus_EnableHWIntrinsic=0`
+
+### Benchmarking
+
+All that complexity needs to pay off. We need to **benchmark the code to verify that the investment is beneficial**. We can do that with [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet).
+
+#### Custom config
+
+It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios:
+
+```cs
+static void Main(string[] args)
+{
+    Job enough = Job.Default
+        .WithWarmupCount(1)
+        .WithIterationTime(TimeInterval.FromSeconds(0.25))
+        .WithMaxIterationCount(20);
+
+    IConfig config = DefaultConfig.Instance
+        .HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error)
+        .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig
+            (exportGithubMarkdown: true, printInstructionAddresses: false)))
+        .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline());
+
+    if (Vector256.IsHardwareAccelerated)
+    {
+        config = config
+            .AddJob(enough.WithId("Vector256"))
+            .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableAVX2", "0").WithId("Vector128"));
+
+    }
+    else if (Vector128.IsHardwareAccelerated)
+    {
+        config = config.AddJob(enough.WithId("Vector128"));
+    }
+
+    BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly)
+        .Run(args, config);
+}
+```
+
+**Note:** the config defines a [disassembler](https://adamsitnik.com/Disassembly-Diagnoser/), which exports a disassembly in GitHub markdown format (supported on both x64 and arm64, Windows and Linux). It is very often an invaluable tool when working with high-performance code where inspecting generated assembly code is required.
+
+#### Memory alignment
+
+BenchmarkDotNet does a lot of heavy lifting for the end users, but it can not protect us from the random memory alignment which can be different per each benchmark run and affect the stability of the benchmarks.
+
+We have three possibilities:
+
+* We can enforce the alignment ourselves and have very stable results.
+* We can ask the harness to try to randomize the memory and observe entire possible distribution with each run.
+* We can do nothing and wonder why the results vary from time to time.
+
+##### Enforcing memory alignment
+
+We can allocate aligned unmanaged memory by using the [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc).
+
+```cs
+public unsafe class Benchmarks
+{
+    private void* _pointer;
+
+    [Params(6, 32, 1024)] // test various sizes
+    public uint Size;
+
+    [GlobalSetup]
+    public void Setup()
+    {
+        _pointer = NativeMemory.AlignedAlloc(byteCount: Size * sizeof(int), alignment: 32);
+        new Span<int>(_pointer, (int)Size).Fill(0); // ensure it's all zeros, so 1 is never found
+    }
+
+    [Benchmark]
+    public bool Contains()
+    {
+        ReadOnlySpan<int> buffer = new (_pointer, (int)Size);
+        return buffer.Contains(1);
+    }
+
+    [GlobalCleanup]
+    public void Cleanup() => NativeMemory.AlignedFree(_pointer);
+}
+```
+
+Sample results (please mind the AVX2, AVX and SSE4.2 information printed in the summary):
+
+```ini
+BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
+AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
+.NET SDK=8.0.100-alpha.1.22558.1
+  [Host]    : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2
+  Scalar    : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT
+  Vector128 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX
+  Vector256 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2
+```
+
+```
+|   Method |       Job | Size |       Mean |    StdDev | Ratio | Code Size |
+|--------- |---------- |----- |-----------:|----------:|------:|----------:|
+| Contains |    Scalar | 1024 | 143.844 ns | 0.6234 ns |  1.00 |     206 B |
+| Contains | Vector128 | 1024 | 104.544 ns | 1.2792 ns |  0.73 |     335 B |
+| Contains | Vector256 | 1024 |  55.769 ns | 0.6720 ns |  0.39 |     391 B |
+```
+
+The results should be very stable (flat distributions), but on the other hand we are measuring the performance of best case scenario (the input is large and it's entire content is searched for, as the value is never found).
+
+Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements.
+
+##### Memory randomization
+
+The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory.
+
+You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject.
+
+No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, you might not benefit from it, or even regress the performance.
+
+## Loops
+
+To work with inputs that are bigger than a single vector, we typically need to loop over the entire input. This should be split into two parts:
+
+* vectorized loop that operates on multiple values at a time
+* handling of the remainder
+
+Example: our input is a buffer of ten integers, assuming that `Vector128` is accelerated, we handle the first four values in the first loop iteration, the next four in the second iteration and then we stop, as only two are left. Depending on how we can handle the remainder, we distinguish two approaches.
+
+### Scalar remainder handling
+
+Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128/32) integers in one iteration. In the second loop, we handle the remaining values.
+
+
+```cs
+int Sum(Span<int> buffer)
+{
+    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
+
+    // The initial sum is zero, so we need a vector with all elements initialized to zero.
+    Vector128<int> sum = Vector128<int>.Zero;
+
+    // We need to obtain the reference to first value in the buffer, it's used later for loading vectors from memory.
+    ref int searchSpace = ref MemoryMarshal.GetReference(buffer);
+    // And an offset, that is going to be used by vectorized and scalar loops.
+    nuint elementOffset = 0;
+    // And the last valid offset from which we can load the values
+    nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
+    for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
+    {
+        // We load a vector from given offset.
+        Vector128<int> loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
+        // We add 4 integers at a time:
+        sum += loaded;
+    }
+
+    // We sum all 4 integers from the vector to one
+    int result = Vector128.Sum(sum);
+
+    // And handle the remaining elements, in a non-vectorized way:
+    while (elementOffset < (nuint)buffer.Length)
+    {
+        result += buffer[(int)elementOffset];
+        elementOffset++;
+    }
+
+    return result;
+}
+```
+
+**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios. If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced.
+
+**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution!
+
+**Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128<T>` and `Vector256<T>` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods.
+
+### Vectorized remainder handling
+
+Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
+
+Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice.
+
+```cs
+bool Contains(Span<int> buffer, int searched)
+{
+    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
+
+    Vector128<int> loaded;
+    // We need a vector for storing the searched value.
+    Vector128<int> values = Vector128.Create(searched);
+
+    ref int searchSpace = ref MemoryMarshal.GetReference(buffer);
+    nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
+    for (nuint elementOffset = 0; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
+    {
+        loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
+        // compare the loaded vector with searched value vector
+        if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
+        {
+            return true; // return true if a difference was found
+        }
+    }
+
+    // If any elements remain, process the last vector in the search space.
+    if ((uint)buffer.Length % Vector128<int>.Count != 0)
+    {
+        loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd);
+        if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
+        {
+            return true;
+        }
+    }
+
+    return false;
+}
+```
+
+`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128<int>.Zero` is an equivalent of `Vector128.Create(0)`.
+
+`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.
+
+### AV testing
+
+Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues.
+
+Let's look at the following code:
+
+```diff
+nuint elementOffset = 0;
+while (elementOffset < (nuint)buffer.Length)
+{
+    loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
+
+    elementOffset += (nuint)Vector128<int>.Count;
+}
+```
+
+How many time the loop is going to execute for a buffer of six integers? Twice! The first time it's going to load the first four elements, the second time it's going to load the two last elements and turn random memory that is following the buffer into next two elements!
+
+Writing tests that detect such issues is hard, but not impossible. .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`.
+
+## Loading and storing vectors
+
+### Loading
+
+Both `Vector128` and `Vector256` provide at least five ways of loading them from memory:
+
+```cs
+public static class Vector128
+{
+    public static Vector128<T> Load<T>(T* source) where T : unmanaged;
+    public static Vector128<T> LoadAligned<T>(T* source) where T : unmanaged;
+    public static Vector128<T> LoadAlignedNonTemporal<T>(T* source) where T : unmanaged;
+    public static Vector128<T> LoadUnsafe<T>(ref T source) where T : struct;
+    public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
+}
+```
+
+The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first (the GC is not tracking unmanaged pointers, we have to ensure that the memory does not get moved by GC in the meantime, as the pointers would silently become invalid). That is simple, the problem is doing the pointer arithmetic right:
+
+```cs
+unsafe int UnmanagedPointersSum(Span<int> buffer)
+{
+    fixed (int* pBuffer = buffer)
+    {
+        int* pEnd = pBuffer + buffer.Length;
+        int* pOneVectorFromEnd = pEnd - Vector128<int>.Count;
+        int* pCurrent = pBuffer;
+
+        Vector128<int> sum = Vector128<int>.Zero;
+
+        while (pCurrent <= pOneVectorFromEnd)
+        {
+            sum += Vector128.Load(pCurrent);
+
+            pCurrent += Vector128<int>.Count;
+        }
+
+        int result = Vector128.Sum(sum);
+
+        while (pCurrent < pEnd)
+        {
+            result += *pCurrent;
+
+            pCurrent++;
+        }
+
+        return result;
+    }
+}
+```
+
+The `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity.
+
+Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
+
+The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware.
+
+The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic:
+
+```cs
+int ManagedReferencesSum(int[] buffer)
+{
+    ref int current = ref MemoryMarshal.GetArrayDataReference(buffer);
+    ref int end = ref Unsafe.Add(ref current, buffer.Length);
+    ref int oneVectorAwayFromEnd = ref Unsafe.Add(ref end, -Vector128<int>.Count);
+
+    Vector128<int> sum = Vector128<int>.Zero;
+
+    while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd))
+    {
+        sum += Vector128.LoadUnsafe(ref current);
+
+        current = ref Unsafe.Add(ref current, Vector128<int>.Count);
+    }
+
+    int result = Vector128.Sum(sum);
+
+    while (Unsafe.IsAddressLessThan(ref current, ref end))
+    {
+        result += current;
+
+        current = ref Unsafe.Add(ref current, 1);
+    }
+
+    return result;
+}
+```
+
+**Note:** `Unsafe` does not expose a method called "IsGreaterOrEqualThan", so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect.
+
+**Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple:
+
+```cs
+ref TValue currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128<TValue>.Count);
+
+do
+{
+    equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
+    if (equals == Vector128<TValue>.Zero)
+    {
+        currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<TValue>.Count);
+        continue;
+    }
+
+    return ...;
+}
+while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));
+```
+
+It was part of `LastIndexOf` implementation, where we were iterating from the end to the beginning of the buffer. In the last iteration of the loop, `currentSearchSpace` could become a pointer to unknown memory that lied before the beginning of the buffer:
+
+```cs
+currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<TValue>.Count);
+```
+
+And it was fine until GC kicked right after that, moved objects in memory, updated all valid managed references and resumed the execution, which run following condition:
+
+```cs
+while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));
+```
+
+Which could return true because `currentSearchSpace` was invalid and not updated. If you are interested in more details, you can check the [issue](https://github.com/dotnet/runtime/issues/75792#issuecomment-1249973858) and the [fix](https://github.com/dotnet/runtime/pull/75857).
+
+That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic!**
+
+```cs
+public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
+```
+
+**The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.**
+
+```cs
+Span<int> buffer = new int[2] { 1, 2 };
+nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
+Console.WriteLine(oneVectorAwayFromEnd);
+```
+
+Can you guess the result? For a 64 bit process it's `FFFFFFFFFFFFFFFE` (a hex representation of `18446744073709551614`)! That is why the length of the buffer needs to be always checked before doing similar computations!
+
+### Storing
+
+Similarly to loading, both `Vector128` and `Vector256` provide at least five ways of storing them in memory:
+
+```cs
+public static class Vector128
+{
+    public static void Store<T>(this Vector128<T> source, T* destination) where T : unmanaged;
+    public static void StoreAligned<T>(this Vector128<T> source, T* destination) where T : unmanaged;
+    public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination) where T : unmanaged;
+    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination) where T : struct;
+    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct;
+}
+```
+
+For the reasons described for loading, we recommend using the overload that takes managed reference and element offset:
+
+```cs
+public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct;
+```
+
+**Note**: when loading values from one buffer and storing them into another, we need to consider whether they overlap or not. [MemoryExtensions.Overlap](https://learn.microsoft.com/dotnet/api/system.memoryextensions.overlaps#system-memoryextensions-overlaps-1(system-readonlyspan((-0))-system-readonlyspan((-0)))) is an API for doing that.
+
+### Casting
+
+As mentioned before, `Vector128<T>` and `Vector256<T>` are constrained to a specific set of primitive types. `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution.
+
+[Unsafe.As<TFrom, TTo>](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type:
+
+```cs
+void CastingReferences(Span<char> buffer)
+{
+    ref char charSearchSpace = ref MemoryMarshal.GetReference(buffer);
+    ref short searchSpace = ref Unsafe.As<char, short>(ref charSearchSpace);
+    // from now on we can use Vector128<short> or Vector256<short>
+}
+```
+
+Or [MemoryMarshal.Cast<TFrom, TTo>](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.memorymarshal.cast#system-runtime-interopservices-memorymarshal-cast-2(system-readonlyspan((-0)))), which casts a span of one primitive type to a span of another primitive type:
+
+```cs
+void CastingSpans(Span<char> chars)
+{
+    Span<short> shorts = MemoryMarshal.Cast<char, short>(chars);
+}
+```
+
+It's also possible to get managed references from unmanaged pointers:
+
+```cs
+void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer)
+{
+    // of the same type:
+    ref byte asciiBuffer = ref *pAsciiBuffer;
+    // of different types:
+    ref ushort utf16Buffer = ref *(ushort*)pUtf16Buffer;
+}
+```
+
+We should avoid doing this in the opposite direction, as most engineers will assume that unmanaged pointers are already pinned.
+
+## Mindset
+
+Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly.
+
+Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved.
+
+### Edge cases
+
+Before we start working on the implementation, let's list all edge cases for our `IsAcii(ReadOnlySpan<byte> buffer)` method (and ideally write tests):
+
+* It does not need to throw any argument exceptions, as `ReadOnlySpan` is `struct` and it can never be `null` or invalid.
+* It should return `true` for an empty buffer.
+* It should detect invalid characters in the entire buffer, including the remainder.
+* It should not read any bytes that don't belong to the provided buffer.
+
+### Scalar solution
+
+Once we know all edge cases, we need to understand our problem and find a scalar solution.
+
+ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned) as `sbyte` (signed), it's a matter of performing "is less than zero" check.
+
+The binary representation of 0-127 range is following:
+
+```log
+00000000
+01111111
+^
+most significant bit
+```
+
+When we look at it, we can realize that another way is checking whether the most significant bit is equal `1`. For the scalar version, we could perform a logical AND:
+
+```cs
+bool IsValidAscii(byte c) => (c & 0b1000_0000) == 0;
+```
+
+### Vectorized solution
+
+Another step is vectorizing our scalar solution and choosing the best way of doing that based on data.
+
+If we reuse one of the loops presented in the previous sections, all we need to implement is a method that accepts `Vector128<byte>` and returns `bool` and does exactly the same thing that our scalar method did, but for a vector rather than single value:
+
+```cs
+[MethodImpl(MethodImplOptions.AggressiveInlining)]
+bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
+{
+    // to perform "> 127" check we can use GreaterThanAny method:
+    return Vector128.GreaterThanAny(asciiVector, Vector128.Create((byte)127))
+    // to perform "< 0" check, we need to use AsSByte and LessThanAny methods:
+    return Vector128.LessThanAny(asciiVector.AsSByte(), Vector128<sbyte>.Zero)
+    // to perform an AND operation, we need to use & operator
+    return (asciiVector & Vector128.Create((byte)0b_1000_0000)) != Vector128<byte>.Zero;
+    // we can also just use ExtractMostSignificantBits method:
+    return asciiVector.ExtractMostSignificantBits() != 0;
+}
+```
+
+We can also use the hardware-specific instructions if they are available:
+
+```cs
+if (Sse41.IsSupported)
+{
+    return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000));
+}
+else if (AdvSimd.Arm64.IsSupported)
+{
+    Vector128<byte> maxBytes = AdvSimd.Arm64.MaxPairwise(asciiVector, asciiVector);
+    return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) != 0;
+}
+```
+
+Benchmark all available solutions, and choose the one that is the best for us.
+
+```ini
+BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
+AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
+.NET SDK=8.0.100-alpha.1.22558.1
+  [Host]   : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2
+```
+
+```
+|                     Method | Size |      Mean | Ratio | Code Size |
+|--------------------------- |----- |----------:|------:|----------:|
+|                     Scalar | 1024 | 252.13 ns |  1.00 |      69 B |
+|             GreaterThanAny | 1024 |  32.49 ns |  0.13 |     178 B |
+|                LessThanAny | 1024 |  29.33 ns |  0.12 |     146 B |
+|                        And | 1024 |  26.13 ns |  0.10 |     138 B |
+|                      TestZ | 1024 |  27.26 ns |  0.11 |     129 B |
+| ExtractMostSignificantBits | 1024 |  27.33 ns |  0.11 |     141 B |
+```
+
+Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**.
+
+## Toolchain
+
+`Vector128`, `Vector128<T>`, `Vector256` and `Vector256<T>` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what  each of these methods is doing, it's important to remember what kind of operations they allow for and check the details when needed.
+
+### Creation
+
+Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value.
+
+```cs
+public static Vector128<T> Create<T>(T value) where T : struct;
+```
+
+`CreateScalar` initializes first element to the specified value, and the remaining elements to zero.
+
+```cs
+public static Vector128<int> CreateScalar(int value);
+```
+
+`CreateScalarUnsafe` is similar, but the remaining elements are left uninitialized. It's dangerous!
+
+
+We also have an overload that allows for specifying every value in given vector:
+
+```cs
+public static Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7)
+```
+
+And last, but not least a `Create` overload that accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ<T>.Count`-many elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (performance).
+
+```cs
+public static Vector128<T> Create<T>(ReadOnlySpan<T> values) where T : struct
+```
+
+to perform a copy in the other direction, we can use one of the `CopyTo` extension methods:
+
+```cs
+public static void CopyTo<T>(this Vector128<T> vector, Span<T> destination) where T : struct
+```
+
+### Bit operations
+
+All size-specific vector types provide a set of APIs for common bit operations.
+
+`BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`).
+
+```cs
+public static Vector128<T> BitwiseAnd<T>(Vector128<T> left, Vector128<T> right) where T : struct => left & right;
+public static Vector128<T> BitwiseOr<T>(Vector128<T> left, Vector128<T> right) where T : struct => left | right;
+public static Vector128<T> Xor<T>(Vector128<T> left, Vector128<T> right) => left ^ right;
+public static Vector128<T> Negate<T>(Vector128<T> vector) => ~vector;
+```
+
+`AndNot` computes the bitwise-and of a given vector and the ones complement of another vector.
+
+```cs
+public static Vector128<T> AndNot<T>(Vector128<T> left, Vector128<T> right) => left & ~right;
+```
+
+`ShiftLeft` shifts each element of a vector left by the specified number of bits.
+`ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift:
+
+```cs
+public static Vector128<sbyte> ShiftLeft(Vector128<sbyte> vector, int shiftCount);
+public static Vector128<sbyte> ShiftRightArithmetic(Vector128<sbyte> vector, int shiftCount);
+public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shiftCount);
+```
+
+### Equality
+
+`EqualsAll` compares two vectors to determine if all elements are equal. `EqualsAny` compares two vectors to determine if any elements are equal.
+
+```cs
+public static bool EqualsAll<T>(Vector128<T> left, Vector128<T> right) where T : struct => left == right;
+public static bool EqualsAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
+```
+
+`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` arguments were equal.
+
+```cs
+public static Vector128<T> Equals<T>(Vector128<T> left, Vector128<T> right) where T : struct
+```
+
+How to calculate the index of first match? Let's take a closer look at the result of following equality check:
+
+```cs
+Vector128<int> left = Vector128.Create(1, 2, 3, 4);
+Vector128<int> right = Vector128.Create(0, 0, 3, 0);
+Vector128<int> equals = Vector128.Equals(left, right);
+Console.WriteLine(equals);
+```
+
+```log
+<0, 0, -1, 0>
+```
+
+`-1` is just `FFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element.
+
+```cs
+public static T GetElement<T>(this Vector128<T> vector, int index) where T : struct
+```
+
+But it would not be an optimal solution. We should rather extract the most significant bits:
+
+```cs
+uint mostSignificantBits = equals.ExtractMostSignificantBits();
+Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0'));
+```
+
+```log
+00000000000000000000000000000100
+```
+
+and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get trailing zero count.
+
+To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, and indexed from 0).
+
+If we were working with a buffer loaded from memory (example: searching for the last index of given character in a buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer.
+
+```cs
+int ComputeLastIndex<T>(nint elementOffset, Vector128<T> equals) where T : struct
+{
+    uint mostSignificantBits = equals.ExtractMostSignificantBits();
+
+    int index = 31 - BitOperations.LeadingZeroCount(mostSignificantBits); // 31 = 32 (bits in UInt32) - 1 (indexing from zero)
+
+    return (int)elementOffset + index;
+}
+```
+
+If we were using the `Load` overload that takes only the managed reference, we could use [Unsafe.ByteOffset<T>(ref T, ref T)](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.byteoffset) to calculate the element offset.
+
+```cs
+unsafe int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
+{
+    int elementOffset = (int)Unsafe.ByteOffset(ref searchSpace, ref current) / sizeof(T);
+
+    uint mostSignificantBits = equals.ExtractMostSignificantBits();
+    int index = BitOperations.TrailingZeroCount(mostSignificantBits);
+    
+    return elementOffset + index;
+}
+```
+
+### Comparison
+
+Beside equality checks, vector APIs allow for comparison. The `bool` returning overload return `true` when given condition is true:
+
+```cs
+public static bool GreaterThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool GreaterThanAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool GreaterThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool GreaterThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool LessThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool LessThanAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool LessThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static bool LessThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
+```
+
+Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` meet given condition.
+
+```cs
+public static Vector128<T> GreaterThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static Vector128<T> GreaterThanOrEqual<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static Vector128<T> LessThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static Vector128<T> LessThanOrEqual<T>(Vector128<T> left, Vector128<T> right) where T : struct
+```
+
+`ConditionalSelect` Conditionally selects a value from two vectors on a bitwise basis.
+
+```cs
+public static Vector128<T> ConditionalSelect<T>(Vector128<T> condition, Vector128<T> left, Vector128<T> right)
+```
+
+This method deserves a self-describing example:
+
+```cs
+Vector128<float> left = Vector128.Create(1.0f, 2, 3, 4);
+Vector128<float> right = Vector128.Create(4.0f, 3, 2, 1);
+
+Vector128<float> result = Vector128.ConditionalSelect(Vector128.GreaterThan(left, right), left, right);
+
+Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result);
+```
+
+### Math
+
+Very simple math operations can be also expressed by using the operators:
+
+```cs
+public static Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right) where T : struct => left + right;
+public static Vector128<T> Divide<T>(Vector128<T> left, Vector128<T> right) => left / right;
+public static Vector128<T> Divide<T>(Vector128<T> left, T right) => left / right;
+public static Vector128<T> Multiply<T>(Vector128<T> left, Vector128<T> right) => left * right;
+public static Vector128<T> Multiply<T>(Vector128<T> left, T right) => left * right;
+public static Vector128<T> Subtract<T>(Vector128<T> left, Vector128<T> right) => left - right;
+```
+
+**Note:** Some of the methods accept a single value as the second argument.
+
+`Abs`, `Ceiling`, `Floor`, `Max`, `Min`, `Sqrt` and `Sum` are also provided:
+
+```cs
+public static Vector128<T> Abs<T>(Vector128<T> vector) where T : struct
+public static Vector128<double> Ceiling(Vector128<double> vector)
+public static Vector128<float> Floor(Vector128<float> vector)
+public static Vector128<T> Max<T>(Vector128<T> left, Vector128<T> right)
+public static Vector128<T> Min<T>(Vector128<T> left, Vector128<T> right)
+public static Vector128<T> Sqrt<T>(Vector128<T> vector);
+public static T Sum<T>(Vector128<T> vector) where T : struct
+```
+
+### Conversion
+
+Vector types provide a set of methods dedicated to numbers conversion:
+
+```cs
+public static unsafe Vector128<double> ConvertToDouble(Vector128<long> vector)
+public static unsafe Vector128<double> ConvertToDouble(Vector128<ulong> vector)
+public static unsafe Vector128<int> ConvertToInt32(Vector128<float> vector)
+public static unsafe Vector128<long> ConvertToInt64(Vector128<double> vector)
+public static unsafe Vector128<float> ConvertToSingle(Vector128<int> vector)
+public static unsafe Vector128<float> ConvertToSingle(Vector128<uint> vector)
+public static unsafe Vector128<uint> ConvertToUInt32(Vector128<float> vector)
+public static unsafe Vector128<ulong> ConvertToUInt64(Vector128<double> vector)
+```
+
+And for reinterpretation (no values are being changed, they can be just used as if they were of a different type):
+
+```cs
+public static Vector128<TTo> As<TFrom, TTo>(this Vector128<TFrom> vector)
+public static Vector128<byte> AsByte<T>(this Vector128<T> vector)
+public static Vector128<double> AsDouble<T>(this Vector128<T> vector)
+public static Vector128<short> AsInt16<T>(this Vector128<T> vector)
+public static Vector128<int> AsInt32<T>(this Vector128<T> vector)
+public static Vector128<long> AsInt64<T>(this Vector128<T> vector)
+public static Vector128<nint> AsNInt<T>(this Vector128<T> vector)
+public static Vector128<nuint> AsNUInt<T>(this Vector128<T> vector)
+public static Vector128<sbyte> AsSByte<T>(this Vector128<T> vector)
+public static Vector128<float> AsSingle<T>(this Vector128<T> vector)
+public static Vector128<ushort> AsUInt16<T>(this Vector128<T> vector)
+public static Vector128<uint> AsUInt32<T>(this Vector128<T> vector)
+public static Vector128<ulong> AsUInt64<T>(this Vector128<T> vector)
+```
+
+### Widening and Narrowing
+
+The first half of every vector is called "lower", the second is "upper".
+
+```
+------------------------------128-bits---------------------------
+|           LOWER               |             UPPER             |
+-----------------------------------------------------------------
+|      32       |      32       |      32       |      32       |
+----------------------------------------------------------------|
+|  16   |  16   |  16   |  16   |  16   |  16   |  16   |  16   |
+-----------------------------------------------------------------
+| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
+-----------------------------------------------------------------
+```
+
+In case of `Vector128`, `GetLower` gets the value of the lower 64-bits as a new `Vector64<T>` and `GetUpper` gets the upper 64-bits.
+
+```cs
+public static Vector64<T> GetLower<T>(this Vector128<T> vector)
+public static Vector64<T> GetUpper<T>(this Vector128<T> vector)
+```
+
+Each vector type provides a `Create` method that allows for the creation from lower and upper:
+
+```cs
+public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)
+public static Vector256<byte> Create(Vector128<byte> lower, Vector128<byte> upper)
+```
+
+`Lower` and `Upper` are also used by `Widen`. This method widens a `Vector128<T1>` into two `Vector128<T2>` where `sizeof(T2) == 2 * sizeof(T1)`.
+
+```cs
+public static unsafe (Vector128<ushort> Lower, Vector128<ushort> Upper) Widen(Vector128<byte> source)
+public static unsafe (Vector128<int> Lower, Vector128<int> Upper) Widen(Vector128<short> source)
+public static unsafe (Vector128<long> Lower, Vector128<long> Upper) Widen(Vector128<int> source)
+public static unsafe (Vector128<short> Lower, Vector128<short> Upper) Widen(Vector128<sbyte> source)
+public static unsafe (Vector128<double> Lower, Vector128<double> Upper) Widen(Vector128<float> source)
+public static unsafe (Vector128<uint> Lower, Vector128<uint> Upper) Widen(Vector128<ushort> source)
+public static unsafe (Vector128<ulong> Lower, Vector128<ulong> Upper) Widen(Vector128<uint> source)
+```
+
+It's also possible to widen only the lower or upper part:
+
+```cs
+public static Vector128<ushort> WidenLower(Vector128<byte> source)
+public static Vector128<ushort> WidenUpper(Vector128<byte> source)
+```
+
+An example of widening is converting a buffer of ASCII bytes into characters:
+
+```cs
+byte[] byteBuffer = Enumerable.Range('A', 128 / 8).Select(i => (byte)i).ToArray();
+Vector128<byte> byteVector = Vector128.Create(byteBuffer);
+Console.WriteLine(byteVector);
+(Vector128<ushort> Lower, Vector128<ushort> Upper) = Vector128.Widen(byteVector);
+Console.Write(Lower.AsByte());
+Console.WriteLine(Upper.AsByte());
+
+Vector256<ushort> ushortVector = Vector256.Create(Lower, Upper);
+Span<ushort> ushortBuffer = stackalloc ushort[256 / 16];
+ushortVector.CopyTo(ushortBuffer);
+Span<char> charBuffer = MemoryMarshal.Cast<ushort, char>(ushortBuffer);
+Console.WriteLine(new string(charBuffer));
+```
+
+```log
+<65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80>
+<65, 0, 66, 0, 67, 0, 68, 0, 69, 0, 70, 0, 71, 0, 72, 0><73, 0, 74, 0, 75, 0, 76, 0, 77, 0, 78, 0, 79, 0, 80, 0>
+ABCDEFGHIJKLMNOP
+```
+
+`Narrow` is the opposite of `Widen`.
+
+```cs
+public static unsafe Vector128<float> Narrow(Vector128<double> lower, Vector128<double> upper)
+public static unsafe Vector128<sbyte> Narrow(Vector128<short> lower, Vector128<short> upper)
+public static unsafe Vector128<short> Narrow(Vector128<int> lower, Vector128<int> upper)
+public static unsafe Vector128<int> Narrow(Vector128<long> lower, Vector128<long> upper)
+public static unsafe Vector128<byte> Narrow(Vector128<ushort> lower, Vector128<ushort> upper)
+public static unsafe Vector128<ushort> Narrow(Vector128<uint> lower, Vector128<uint> upper)
+public static unsafe Vector128<uint> Narrow(Vector128<ulong> lower, Vector128<ulong> upper)
+```
+
+In contrary to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector:
+
+
+```cs
+Vector256<ushort> ushortVector = Vector256.Create((ushort)300);
+Console.WriteLine(ushortVector);
+unchecked { Console.WriteLine((byte)300); }
+Console.WriteLine(300 & byte.MaxValue);
+Console.WriteLine(Vector128.Narrow(ushortVector.GetLower(), ushortVector.GetUpper()));
+
+if (Sse2.IsSupported)
+{
+    Console.WriteLine(Sse2.PackUnsignedSaturate(ushortVector.GetLower().AsInt16(), ushortVector.GetUpper().AsInt16()));
+}
+```
+
+```log
+<300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300>
+44
+44
+<44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44>
+<255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255>
+```
+
+### Shuffle
+
+`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes if the input vector).
+
+```cs
+public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
+public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
+public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
+public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
+public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
+public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices)
+```
+
+It can be used for many things, including reversing the input:
+
+```cs
+Vector128<int> intVector = Vector128.Create(100, 200, 300, 400);
+Console.WriteLine(intVector);
+Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0)));
+```
+
+```log
+<100, 200, 300, 400>
+<400, 300, 200, 100>
+```
+
+#### Vector256.Shuffle vs Avx2.Shuffle
+
+`Vector256.Shuffle` and `Avx2.Shuffle` are not identical.
+
+`Avx2.Shuffle` is effectively `2x128-bit ops` and so if we do `Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L))` it is going to think we want `value[0], value[1], value[0], value[1]`. Where-as `Avx2.Shuffle` treats this as `value[0], value[1], value[2], value[3]`.
+
+While `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width".
+
+## Summary
+
+The main goal of the new `Vector128` and `Vector256` APIs is to make writing fast, vectorized code possible without becoming familiar with hardware-specific instructions and becoming an assembly language expert. Our recommendations depend on your current expertise level, software you maintain and the one you need to create:
+
+- If you are already an expert and you have vectorized your code for both `x64/x86` and `arm64/arm` code you can use the new APIs to simplify your code, but you most likely won't observe any performance gains. [#64451](https://github.com/dotnet/runtime/issues/64451) lists the places where it was/can be done in dotnet/runtime. You can use links to the merged PRs to see real-life examples.
+- If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the  new APIs to have a single, cross-platform implementation.
+- If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better codegen.
+- If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths.
+
+### Best practices
+
+1. Implement tests that cover all code paths, including Acces Violation.
+2. Run tests for all hardware acceleration scenarios, use the existing env vars to do that.
+3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users.
+4. Prefer managed references over unsafe pointers to avoid pinning and safety issues.
+5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly.
+6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128<T> source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow!
+7. Always handle the vectorized loop remainder.
+8. When storing values in memory, be aware of a potential buffer overlap.
+9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. Over time, you may gain enough experience to skip the scalar step.
+10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.

From 5e89ed811d8710ac40701385982df23ac253fdf3 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Thu, 30 Mar 2023 14:21:09 +0200
Subject: [PATCH 02/14] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Günther Foidl <gue@korporal.at>
---
 .../vectorization-guidelines.md               | 42 +++++++++----------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 394b91af3ba0c5..cc7d5b93eba8ab 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -38,7 +38,7 @@ Vectorization is an art of converting an algorithm from operating on a single va
 
 In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code.
 
-.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` that aim for writing hardware-agnostic vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices.
+.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` that aim for writing hardware-agnostic, and cross platform vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices.
 
 ## Code structure
 
@@ -64,9 +64,9 @@ Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)
 -----------------------------------------------------------------
 ```
 
-`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, we should prefer it over a `Vector128<T>`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
+`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough we should prefer it over a `Vector128<T>`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
 
-We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector128<T>.Count` return the size of a vector of given type in bytes.
+We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the size of a vector of given type in bytes.
 Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT).
 
 That is why the code is very often structured like this:
@@ -113,15 +113,15 @@ Such a code structure requires us to **test all possible code paths**:
 
 It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started:
 
-* When `COMPlus_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`.
-* When `COMPlus_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`.
-* When `COMPlus_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`.
+* When `DOTNET_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`.
+* When `DOTNET_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`.
+* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`.
 
 Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with:
 * no custom settings
-* `COMPlus_EnableAVX2=0`
-* `COMPlus_EnableAVX=0` (it can be skipped if `Vector64<T>` and `Vector<T>` are not involved)
-* `COMPlus_EnableHWIntrinsic=0`
+* `DOTNET_EnableAVX2=0`
+* `DOTNET_EnableAVX=0` (it can be skipped if `Vector64<T>` and `Vector<T>` are not involved)
+* `DOTNET_EnableHWIntrinsic=0`
 
 ### Benchmarking
 
@@ -143,13 +143,13 @@ static void Main(string[] args)
         .HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error)
         .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig
             (exportGithubMarkdown: true, printInstructionAddresses: false)))
-        .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline());
+        .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline());
 
     if (Vector256.IsHardwareAccelerated)
     {
         config = config
             .AddJob(enough.WithId("Vector256"))
-            .AddJob(enough.WithEnvironmentVariable("COMPlus_EnableAVX2", "0").WithId("Vector128"));
+            .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableAVX2", "0").WithId("Vector128"));
 
     }
     else if (Vector128.IsHardwareAccelerated)
@@ -287,7 +287,7 @@ int Sum(Span<int> buffer)
 }
 ```
 
-**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios. If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced.
+**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced.
 
 **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution!
 
@@ -321,7 +321,7 @@ bool Contains(Span<int> buffer, int searched)
     }
 
     // If any elements remain, process the last vector in the search space.
-    if ((uint)buffer.Length % Vector128<int>.Count != 0)
+    if (buffer.Length % Vector128<int>.Count != 0)
     {
         loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd);
         if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
@@ -338,7 +338,7 @@ bool Contains(Span<int> buffer, int searched)
 
 `Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.
 
-### AV testing
+### Access violation (AV) testing
 
 Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues.
 
@@ -422,7 +422,7 @@ int ManagedReferencesSum(int[] buffer)
 {
     ref int current = ref MemoryMarshal.GetArrayDataReference(buffer);
     ref int end = ref Unsafe.Add(ref current, buffer.Length);
-    ref int oneVectorAwayFromEnd = ref Unsafe.Add(ref end, -Vector128<int>.Count);
+    ref int oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128<int>.Count);
 
     Vector128<int> sum = Vector128<int>.Zero;
 
@@ -577,7 +577,7 @@ Before we start working on the implementation, let's list all edge cases for our
 
 Once we know all edge cases, we need to understand our problem and find a scalar solution.
 
-ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned) as `sbyte` (signed), it's a matter of performing "is less than zero" check.
+ASCII characters are values in the range from `0` to `127` (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than `127`. If we treat `byte` (unsigned, range from 0 to 255) as `sbyte` (signed, range from -128 to 127), it's a matter of performing "is less than zero" check.
 
 The binary representation of 0-127 range is following:
 
@@ -746,7 +746,7 @@ Console.WriteLine(equals);
 <0, 0, -1, 0>
 ```
 
-`-1` is just `FFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element.
+`-1` is just `0xFFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element.
 
 ```cs
 public static T GetElement<T>(this Vector128<T> vector, int index) where T : struct
@@ -1033,9 +1033,7 @@ Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0)));
 
 `Vector256.Shuffle` and `Avx2.Shuffle` are not identical.
 
-`Avx2.Shuffle` is effectively `2x128-bit ops` and so if we do `Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L))` it is going to think we want `value[0], value[1], value[0], value[1]`. Where-as `Avx2.Shuffle` treats this as `value[0], value[1], value[2], value[3]`.
-
-While `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width".
+`Avx2.Shuffle` is effectively `2x128-bit ops` while `Vector256.Shuffle` treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where `AVX-512` and `SVE` all operate on "full width".
 
 ## Summary
 
@@ -1048,8 +1046,8 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas
 
 ### Best practices
 
-1. Implement tests that cover all code paths, including Acces Violation.
-2. Run tests for all hardware acceleration scenarios, use the existing env vars to do that.
+1. Implement tests that cover all code paths, including Acces Violations.
+2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that.
 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users.
 4. Prefer managed references over unsafe pointers to avoid pinning and safety issues.
 5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly.

From fe4aacac0195436a44c5ee8b02e583c142bc631b Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Thu, 30 Mar 2023 22:17:47 +0200
Subject: [PATCH 03/14] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Dan Moseley <danmose@microsoft.com>
Co-authored-by: Günther Foidl <gue@korporal.at>
---
 .../vectorization-guidelines.md               | 36 +++++++++----------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index cc7d5b93eba8ab..32e6dc9ced2415 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -9,7 +9,7 @@
   * [Loops](#loops)
     + [Scalar remainder handling](#scalar-remainder-handling)
     + [Vectorized remainder handling](#vectorized-remainder-handling)
-    + [AV testing](#av-testing)
+    + [Access violation testing](#access-violation-av-testing)
   * [Loading and storing vectors](#loading-and-storing-vectors)
     + [Loading](#loading)
     + [Storing](#storing)
@@ -34,11 +34,11 @@ TL;DR: Go to [Summary](#summary)
 
 # Introduction to vectorization with Vector128 and Vector256
 
-Vectorization is an art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity.
+Vectorization is the art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity.
 
-In the recent releases, .NET has introduced plenty of APIs for vectorization. Vast majority of them were hardware specific. It required the users to provide implementation per processor architecture (x64 and/or arm64), with a possibility to use the most optimal instructions for hardware that is executing the code.
+In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (x64 and/or arm64), with the option of using the most optimal instructions for hardware that is executing the code.
 
-.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` that aim for writing hardware-agnostic, and cross platform vectorized code. The purpose of this document is to introduce the readers to the new APIs and provide a set of best practices.
+.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
 
 ## Code structure
 
@@ -50,7 +50,7 @@ In the recent releases, .NET has introduced plenty of APIs for vectorization. Va
 * `long`, `ulong` and `double` (64 bits).
 * `nint` and `unit` (32 or 64 bits, depending on the architecture)
 
-Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats and 2 (u)longs/double(s).
+A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s).
 
 ```
 ------------------------------128-bits---------------------------
@@ -64,10 +64,10 @@ Each `Vector128` operation allows to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)
 -----------------------------------------------------------------
 ```
 
-`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough we should prefer it over a `Vector128<T>`. To check the acceleration, we need to use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
+`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128<T>`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
 
-We also must account for the size of the input. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the size of a vector of given type in bytes.
-Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. It's not true for pre-compiled code (NativeAOT).
+The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the size of a vector of given type in bytes.
+Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. In case of pre-compiled code (NativeAOT) it's not true for `IsHardwareAccelerated` property, as the required information is not available at compile time.
 
 That is why the code is very often structured like this:
 
@@ -190,7 +190,7 @@ public unsafe class Benchmarks
     public void Setup()
     {
         _pointer = NativeMemory.AlignedAlloc(byteCount: Size * sizeof(int), alignment: 32);
-        new Span<int>(_pointer, (int)Size).Fill(0); // ensure it's all zeros, so 1 is never found
+        NativeMemory.Clear(_pointer, byteCount: Size * sizeof(int)); // ensure it's all zeros, so 1 is never found
     }
 
     [Benchmark]
@@ -235,11 +235,11 @@ The alternative is to enable memory randomization. Before every iteration, the h
 
 You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject.
 
-No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, you might not benefit from it, or even regress the performance.
+No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse.
 
 ## Loops
 
-To work with inputs that are bigger than a single vector, we typically need to loop over the entire input. This should be split into two parts:
+To work with inputs that are bigger than a single vector, you typically need to loop over the entire input. This should be split into two parts:
 
 * vectorized loop that operates on multiple values at a time
 * handling of the remainder
@@ -248,7 +248,7 @@ Example: our input is a buffer of ten integers, assuming that `Vector128` is acc
 
 ### Scalar remainder handling
 
-Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128/32) integers in one iteration. In the second loop, we handle the remaining values.
+Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128 bits / 32 bits) integers in one iteration. In the second loop, we handle the remaining values.
 
 
 ```cs
@@ -287,7 +287,7 @@ int Sum(Span<int> buffer)
 }
 ```
 
-**Note:** Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. It can be used for pinning but must never be dereferenced.
+**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it.
 
 **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution!
 
@@ -340,7 +340,7 @@ bool Contains(Span<int> buffer, int searched)
 
 ### Access violation (AV) testing
 
-Handling the remainder in an invalid way, may lead to non-deterministic and hard to diagnose issues.
+Handling the remainder in an invalid way may lead to non-deterministic and hard to diagnose issues.
 
 Let's look at the following code:
 
@@ -354,9 +354,9 @@ while (elementOffset < (nuint)buffer.Length)
 }
 ```
 
-How many time the loop is going to execute for a buffer of six integers? Twice! The first time it's going to load the first four elements, the second time it's going to load the two last elements and turn random memory that is following the buffer into next two elements!
+How many times will the loop execute for a buffer of six integers? Twice! The first time it will load the first four elements, but the second time it will load the random content of the memory following the buffer!
 
-Writing tests that detect such issues is hard, but not impossible. .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`.
+Writing tests that detect that issue is hard, but not impossible. The .NET Team uses a helper utility called [BoundedMemory](https://github.com/dotnet/runtime/blob/main/src/libraries/Common/tests/TestUtilities/System/Buffers/BoundedMemory.Creation.cs) that allocates a memory region which is immediately preceded by or immediately followed by a poison (`MEM_NOACCESS`) page. Attempting to read the memory immediately before or after it results in `AccessViolationException`.
 
 ## Loading and storing vectors
 
@@ -375,7 +375,7 @@ public static class Vector128
 }
 ```
 
-The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first (the GC is not tracking unmanaged pointers, we have to ensure that the memory does not get moved by GC in the meantime, as the pointers would silently become invalid). That is simple, the problem is doing the pointer arithmetic right:
+The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right:
 
 ```cs
 unsafe int UnmanagedPointersSum(Span<int> buffer)
@@ -409,7 +409,7 @@ unsafe int UnmanagedPointersSum(Span<int> buffer)
 }
 ```
 
-The `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity.
+`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity.
 
 Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
 

From 1fa325fe565f15dc50c7d95cb2505f31b8f26603 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Fri, 31 Mar 2023 15:01:43 +0200
Subject: [PATCH 04/14] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Rob Hague <rob.hague00@gmail.com>
Co-authored-by: Günther Foidl <gue@korporal.at>
---
 .../vectorization-guidelines.md               | 20 +++++++++----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 32e6dc9ced2415..32839aacd65614 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -66,8 +66,8 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short
 
 `Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128<T>`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
 
-The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the size of a vector of given type in bytes.
-Both APIs are turned into constants (no method call is required to retrieve the information) by the Just-In-Time compiler. In case of pre-compiled code (NativeAOT) it's not true for `IsHardwareAccelerated` property, as the required information is not available at compile time.
+The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the number of elements of the given type T in a single vector.
+Both APIs are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time.
 
 That is why the code is very often structured like this:
 
@@ -111,16 +111,14 @@ Such a code structure requires us to **test all possible code paths**:
   * The input is too small to benefit from any kind of vectorization.
 * Neither `Vector128` or  `Vector256` are accelerated.
 
-It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration from unit test level. It can be controlled with environment variables before .NET process is started:
+It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration at the unit test level. It can be controlled with environment variables before .NET process is started:
 
 * When `DOTNET_EnableAVX2` is set to `0`, `Vector256.IsHardwareAccelerated` returns `false`.
-* When `DOTNET_EnableAVX` is set to `0`, `Vector128.IsHardwareAccelerated` returns `false`.
-* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only both mentioned APIs return `false`, but also `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`.
+* When `DOTNET_EnableHWIntrinsic` is set to `0`, not only do both mentioned APIs return `false`, but so also do `Vector64.IsHardwareAccelerated` and `Vector.IsHardwareAccelerated`.
 
-Assuming that we run the tests on an `x64` machine that supports `Vector256` we need to write tests that cover all size scenarios and run them with:
+Assuming that we run the tests on an `x64` machine that supports `Vector256`, we need to write tests that cover all size scenarios and run them with:
 * no custom settings
 * `DOTNET_EnableAVX2=0`
-* `DOTNET_EnableAVX=0` (it can be skipped if `Vector64<T>` and `Vector<T>` are not involved)
 * `DOTNET_EnableHWIntrinsic=0`
 
 ### Benchmarking
@@ -166,12 +164,12 @@ static void Main(string[] args)
 
 #### Memory alignment
 
-BenchmarkDotNet does a lot of heavy lifting for the end users, but it can not protect us from the random memory alignment which can be different per each benchmark run and affect the stability of the benchmarks.
+BenchmarkDotNet does a lot of heavy lifting for the end users, but it cannot protect us from the random memory alignment which can be different per each benchmark run and can affect the stability of the benchmarks.
 
 We have three possibilities:
 
 * We can enforce the alignment ourselves and have very stable results.
-* We can ask the harness to try to randomize the memory and observe entire possible distribution with each run.
+* We can ask the harness to try to randomize the memory and observe the entire possible distribution with each run.
 * We can do nothing and wonder why the results vary from time to time.
 
 ##### Enforcing memory alignment
@@ -233,7 +231,7 @@ Explaining benchmark design guidelines is outside of the scope of this document,
 
 The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory.
 
-You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587), it requires understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) book has two chapters dedicated to statistics and can help you get a very good understanding of this subject.
+You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) has two chapters dedicated to statistics and can help you get a very good understanding of the subject.
 
 No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse.
 
@@ -295,7 +293,7 @@ int Sum(Span<int> buffer)
 
 ### Vectorized remainder handling
 
-Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
+Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
 
 Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice.
 

From e7ab25095c1af29f3624496f21aaf7c2c4f07ba4 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Fri, 31 Mar 2023 15:03:43 +0200
Subject: [PATCH 05/14] Apply suggestions from code review

Co-authored-by: Rob Hague <rob.hague00@gmail.com>
---
 .../vectorization-guidelines.md               | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 32839aacd65614..afcd860d7d9edb 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -223,7 +223,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co
 | Contains | Vector256 | 1024 |  55.769 ns | 0.6720 ns |  0.39 |     391 B |
 ```
 
-The results should be very stable (flat distributions), but on the other hand we are measuring the performance of best case scenario (the input is large and it's entire content is searched for, as the value is never found).
+The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large and its entire contents are searched through, as the value is never found).
 
 Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements.
 
@@ -295,7 +295,7 @@ int Sum(Span<int> buffer)
 
 Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
 
-Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration, we need to handle the remaining two, but it's less than `Vector128` size, so we handle last four elements. Which means that two values in the middle get checked twice.
+Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements.
 
 ```cs
 bool Contains(Span<int> buffer, int searched)
@@ -332,7 +332,7 @@ bool Contains(Span<int> buffer, int searched)
 }
 ```
 
-`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128<int>.Zero` is an equivalent of `Vector128.Create(0)`.
+`Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128<int>.Zero` is equivalent to `Vector128.Create(0)`.
 
 `Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.
 
@@ -651,7 +651,7 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop
 
 ## Toolchain
 
-`Vector128`, `Vector128<T>`, `Vector256` and `Vector256<T>` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what  each of these methods is doing, it's important to remember what kind of operations they allow for and check the details when needed.
+`Vector128`, `Vector128<T>`, `Vector256` and `Vector256<T>` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed.
 
 ### Creation
 
@@ -676,7 +676,7 @@ We also have an overload that allows for specifying every value in given vector:
 public static Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7)
 ```
 
-And last, but not least a `Create` overload that accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ<T>.Count`-many elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (performance).
+And last but not least we have a `Create` overload which accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ<T>.Count` elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (for performance).
 
 ```cs
 public static Vector128<T> Create<T>(ReadOnlySpan<T> values) where T : struct
@@ -725,13 +725,13 @@ public static bool EqualsAll<T>(Vector128<T> left, Vector128<T> right) where T :
 public static bool EqualsAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
 ```
 
-`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` arguments were equal.
+`Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in the `left` and `right` arguments were equal.
 
 ```cs
 public static Vector128<T> Equals<T>(Vector128<T> left, Vector128<T> right) where T : struct
 ```
 
-How to calculate the index of first match? Let's take a closer look at the result of following equality check:
+How do we calculate the index of the first match? Let's take a closer look at the result of following equality check:
 
 ```cs
 Vector128<int> left = Vector128.Create(1, 2, 3, 4);
@@ -750,7 +750,7 @@ Console.WriteLine(equals);
 public static T GetElement<T>(this Vector128<T> vector, int index) where T : struct
 ```
 
-But it would not be an optimal solution. We should rather extract the most significant bits:
+But it would not be an optimal solution. We should instead extract the most significant bits:
 
 ```cs
 uint mostSignificantBits = equals.ExtractMostSignificantBits();
@@ -761,11 +761,11 @@ Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0'));
 00000000000000000000000000000100
 ```
 
-and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get trailing zero count.
+and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get the trailing zero count.
 
-To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, and indexed from 0).
+To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0).
 
-If we were working with a buffer loaded from memory (example: searching for the last index of given character in a buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer.
+If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer.
 
 ```cs
 int ComputeLastIndex<T>(nint elementOffset, Vector128<T> equals) where T : struct
@@ -794,7 +794,7 @@ unsafe int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> e
 
 ### Comparison
 
-Beside equality checks, vector APIs allow for comparison. The `bool` returning overload return `true` when given condition is true:
+Beside equality checks, vector APIs allow for comparison. The `bool`-returning overloads return `true` when the given condition is true:
 
 ```cs
 public static bool GreaterThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
@@ -807,7 +807,7 @@ public static bool LessThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right)
 public static bool LessThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
 ```
 
-Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on if the corresponding elements in `left` and `right` meet given condition.
+Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in `left` and `right` meet the given condition.
 
 ```cs
 public static Vector128<T> GreaterThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
@@ -862,7 +862,7 @@ public static T Sum<T>(Vector128<T> vector) where T : struct
 
 ### Conversion
 
-Vector types provide a set of methods dedicated to numbers conversion:
+Vector types provide a set of methods dedicated to number conversions:
 
 ```cs
 public static unsafe Vector128<double> ConvertToDouble(Vector128<long> vector)
@@ -1003,7 +1003,7 @@ if (Sse2.IsSupported)
 
 ### Shuffle
 
-`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes if the input vector).
+`Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes of the input vector).
 
 ```cs
 public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
@@ -1044,7 +1044,7 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas
 
 ### Best practices
 
-1. Implement tests that cover all code paths, including Acces Violations.
+1. Implement tests that cover all code paths, including Access Violations.
 2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that.
 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users.
 4. Prefer managed references over unsafe pointers to avoid pinning and safety issues.

From 16c1819fc9be0b61ff7fd33f28f16b8dd9738dc3 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Mon, 3 Apr 2023 10:05:32 +0200
Subject: [PATCH 06/14] Apply suggestions from code review

Co-authored-by: Rob Hague <rob.hague00@gmail.com>
---
 docs/coding-guidelines/vectorization-guidelines.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index afcd860d7d9edb..520b465fdc071c 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -701,7 +701,7 @@ public static Vector128<T> Xor<T>(Vector128<T> left, Vector128<T> right) => left
 public static Vector128<T> Negate<T>(Vector128<T> vector) => ~vector;
 ```
 
-`AndNot` computes the bitwise-and of a given vector and the ones complement of another vector.
+`AndNot` computes the bitwise-and of a given vector and the ones' complement of another vector.
 
 ```cs
 public static Vector128<T> AndNot<T>(Vector128<T> left, Vector128<T> right) => left & ~right;

From ebc7da6e9ce6e7965dcc9d56cd29c81caf7ba296 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Wed, 5 Apr 2023 11:09:10 +0200
Subject: [PATCH 07/14] Apply suggestions from code review

Co-authored-by: Tanner Gooding <tagoo@outlook.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>
---
 docs/coding-guidelines/vectorization-guidelines.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 520b465fdc071c..42623ebe5c0556 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -34,9 +34,9 @@ TL;DR: Go to [Summary](#summary)
 
 # Introduction to vectorization with Vector128 and Vector256
 
-Vectorization is the art of converting an algorithm from operating on a single value at a time to operating on a set of values (vector). It can greatly improve performance at a cost of increased code complexity.
+Vectorization is the art of converting an algorithm from operating on a single value per iteration to operating on a set of values (vector) per iteration. It can greatly improve performance at a cost of increased code complexity.
 
-In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (x64 and/or arm64), with the option of using the most optimal instructions for hardware that is executing the code.
+In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code.
 
 .NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
 
@@ -67,7 +67,7 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short
 `Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128<T>`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
 
 The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the number of elements of the given type T in a single vector.
-Both APIs are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time.
+Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time.
 
 That is why the code is very often structured like this:
 
@@ -346,7 +346,7 @@ Let's look at the following code:
 nuint elementOffset = 0;
 while (elementOffset < (nuint)buffer.Length)
 {
-    loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
+    loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); // BUG!
 
     elementOffset += (nuint)Vector128<int>.Count;
 }

From f679f763fd2a42dac323869f17bafe80ff7abff5 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Wed, 5 Apr 2023 11:28:03 +0200
Subject: [PATCH 08/14] address code review feedback from @tannergooding and
 @stephentoub

---
 .../vectorization-guidelines.md               | 65 ++++++++++++++-----
 1 file changed, 50 insertions(+), 15 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 42623ebe5c0556..72efccf6ca0b2a 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -38,17 +38,21 @@ Vectorization is the art of converting an algorithm from operating on a single v
 
 In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code.
 
-.NET 7 introduced a set of new APIs for `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code. The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
+.NET 7 introduced a set of new APIs for `Vector64<T>`, `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code (`Vector512<T>` is being introduced in .NET 8). The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
 
 ## Code structure
 
-`Vector128<T>` represents a 128-bit vector of type `T`. `T` is constrained to specific primitive types:
+`Vector128<T>` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector of type `T`.
+
+`T` is constrained to specific primitive types:
 
 * `byte` and `sbyte` (8 bits).
 * `short` and `ushort` (16 bits).
 * `int`, `uint` and `float` (32 bits).
 * `long`, `ulong` and `double` (64 bits).
-* `nint` and `unit` (32 or 64 bits, depending on the architecture)
+* `nint` and `unit` (32 or 64 bits, depending on the architecture, available in .NET 7+)
+
+.NET 8 is introducing a `Vector128<T>.IsSupported` that helps identify whether a given `T` will throw or not to help identify what works per runtime, including from generic contexts.
 
 A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s).
 
@@ -64,11 +68,15 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short
 -----------------------------------------------------------------
 ```
 
-`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough, you should use it instead of `Vector128<T>`. To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
+`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough and the benchmarks prove that it offers better performance, you should use it instead of `Vector128<T>`. Namely, `Vector256<T>` on x86/x64 is mostly treated as `2x Vector128<T>` and while there are some operations that can "cross lanes", they can sometimes be more expensive or have other hidden costs.
+
+To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
 
-The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path. `Vector128<T>.Count` and `Vector256<T>.Count` return the number of elements of the given type T in a single vector.
+The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). `Vector128<T>.Count` and `Vector256<T>.Count` return the number of elements of the given type T in a single vector.
 Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time.
 
+**Note:** When `Vector256` is accelerated then `Vector128` and `Vector64` are also accelerated.
+
 That is why the code is very often structured like this:
 
 ```cs
@@ -89,6 +97,26 @@ void CodeStructure(ReadOnlySpan<byte> buffer)
 }
 ```
 
+To reduce the number of comparisons for small inputs, we can re-arrange it in the following way:
+
+```cs
+void OptimalCodeStructure(ReadOnlySpan<byte> buffer)
+{
+    if (!Vector128.IsHardwareAccelerated || buffer.Length < Vector128<byte>.Count)
+    { 
+        // scalar code path
+    } 
+    else if (!Vector256.IsHardwareAccelerated || buffer.Length < Vector256<byte>.Count)
+    { 
+        // Vector128 code path
+    } 
+    else
+    { 
+        // Vector256 code path
+    }
+}
+```
+
 **Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block:
 
 ```cs
@@ -104,7 +132,7 @@ Such a code structure requires us to **test all possible code paths**:
 
 * `Vector256` is accelerated:
   * The input is large enough to benefit from vectorization with `Vector256`.
-  * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128` (when `Vector256` is accelerated then `Vector128` and smaller vectors are also).
+  * The input is not large enough to benefit from vectorization with `Vector256`, but it can benefit from vectorization with `Vector128`.
   * The input is too small to benefit from any kind of vectorization.
 * `Vector128` is accelerated
   * The input is large enough to benefit from vectorization with `Vector128`.
@@ -121,6 +149,8 @@ Assuming that we run the tests on an `x64` machine that supports `Vector256`, we
 * `DOTNET_EnableAVX2=0`
 * `DOTNET_EnableHWIntrinsic=0`
 
+The alternative is running tests on enough variation of hardware to cover all the paths.
+
 ### Benchmarking
 
 All that complexity needs to pay off. We need to **benchmark the code to verify that the investment is beneficial**. We can do that with [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet).
@@ -231,7 +261,7 @@ Explaining benchmark design guidelines is outside of the scope of this document,
 
 The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory.
 
-You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) has two chapters dedicated to statistics and can help you get a very good understanding of the subject.
+You can read more about it [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587). It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but a book on statistics, such as [Pro .NET Benchmarking](https://aakinshin.net/prodotnetbenchmarking/) can help you get a very good understanding of the subject.
 
 No matter how you are going to benchmark your code, you need to keep in mind that **the larger the input, the more you can benefit from vectorization**. If your code uses small buffers, performance might even get worse.
 
@@ -287,7 +317,7 @@ int Sum(Span<int> buffer)
 
 **Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it.
 
-**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution!
+**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan<T>.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference).
 
 **Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128<T>` and `Vector256<T>` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods.
 
@@ -308,7 +338,8 @@ bool Contains(Span<int> buffer, int searched)
 
     ref int searchSpace = ref MemoryMarshal.GetReference(buffer);
     nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
-    for (nuint elementOffset = 0; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
+    nuint elementOffset = 0;
+    for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
     {
         loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
         // compare the loaded vector with searched value vector
@@ -319,7 +350,7 @@ bool Contains(Span<int> buffer, int searched)
     }
 
     // If any elements remain, process the last vector in the search space.
-    if (buffer.Length % Vector128<int>.Count != 0)
+    if (elementOffset != (uint)buffer.Length)
     {
         loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd);
         if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
@@ -373,7 +404,7 @@ public static class Vector128
 }
 ```
 
-The first three overloads require a pointer to the source. To be able to use a pointer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right:
+The first three overloads require a pointer to the source. To be able to use a pointer to a managed buffer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right:
 
 ```cs
 unsafe int UnmanagedPointersSum(Span<int> buffer)
@@ -418,6 +449,8 @@ The fourth method expects only a managed reference (`ref T source`). We don't ne
 ```cs
 int ManagedReferencesSum(int[] buffer)
 {
+    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
+
     ref int current = ref MemoryMarshal.GetArrayDataReference(buffer);
     ref int end = ref Unsafe.Add(ref current, buffer.Length);
     ref int oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128<int>.Count);
@@ -444,7 +477,7 @@ int ManagedReferencesSum(int[] buffer)
 }
 ```
 
-**Note:** `Unsafe` does not expose a method called "IsGreaterOrEqualThan", so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect.
+**Note:** `Unsafe` does not expose a method called `IsLessThanOrEqualTo`, so we are using a negation of `Unsafe.IsAddressGreaterThan` to achieve desired effect.
 
 **Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple:
 
@@ -479,7 +512,7 @@ while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));
 
 Which could return true because `currentSearchSpace` was invalid and not updated. If you are interested in more details, you can check the [issue](https://github.com/dotnet/runtime/issues/75792#issuecomment-1249973858) and the [fix](https://github.com/dotnet/runtime/pull/75857).
 
-That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic!**
+That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.**
 
 ```cs
 public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
@@ -520,7 +553,7 @@ public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, n
 
 ### Casting
 
-As mentioned before, `Vector128<T>` and `Vector256<T>` are constrained to a specific set of primitive types. `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution.
+As mentioned before, `Vector128<T>` and `Vector256<T>` are constrained to a specific set of primitive types. Currently, `char` is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution.
 
 [Unsafe.As<TFrom, TTo>](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type:
 
@@ -554,7 +587,7 @@ void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer)
 }
 ```
 
-We should avoid doing this in the opposite direction, as most engineers will assume that unmanaged pointers are already pinned.
+It's only safe to convert a managed reference to a pointer if it's known that the reference is already pinned. If it's not, the moment after you get the pointer it could be invalid.
 
 ## Mindset
 
@@ -653,6 +686,8 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop
 
 `Vector128`, `Vector128<T>`, `Vector256` and `Vector256<T>` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed.
 
+**Note:** all of these methods have "software fallbacks", which are executed when they cannot be vectorized on given platform.
+
 ### Creation
 
 Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value.

From 06105b791e1308aad7b69bc37af87a2d23500776 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Thu, 6 Apr 2023 17:02:04 +0200
Subject: [PATCH 09/14] Apply suggestions from code review

Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>
---
 docs/coding-guidelines/vectorization-guidelines.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 72efccf6ca0b2a..a369deeac055ae 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -38,7 +38,7 @@ Vectorization is the art of converting an algorithm from operating on a single v
 
 In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code.
 
-.NET 7 introduced a set of new APIs for `Vector64<T>`, `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code (`Vector512<T>` is being introduced in .NET 8). The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
+.NET 7 introduced a set of new APIs for `Vector64<T>`, `Vector128<T>` and `Vector256<T>` for writing hardware-agnostic, cross platform vectorized code. Similarly, .NET 8 introduced `Vector512<T>`. The purpose of this document is to introduce you to the new APIs and provide a set of best practices.
 
 ## Code structure
 
@@ -50,9 +50,9 @@ In recent releases, .NET has introduced many new APIs for vectorization. The vas
 * `short` and `ushort` (16 bits).
 * `int`, `uint` and `float` (32 bits).
 * `long`, `ulong` and `double` (64 bits).
-* `nint` and `unit` (32 or 64 bits, depending on the architecture, available in .NET 7+)
+* `nint` and `nuint` (32 or 64 bits, depending on the architecture, available in .NET 7+)
 
-.NET 8 is introducing a `Vector128<T>.IsSupported` that helps identify whether a given `T` will throw or not to help identify what works per runtime, including from generic contexts.
+.NET 8 introduced a `Vector128<T>.IsSupported` that indicates whether a given `T` will throw to help identify what works per runtime, including from generic contexts.
 
 A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s).
 
@@ -601,7 +601,7 @@ Before we start working on the implementation, let's list all edge cases for our
 
 * It does not need to throw any argument exceptions, as `ReadOnlySpan` is `struct` and it can never be `null` or invalid.
 * It should return `true` for an empty buffer.
-* It should detect invalid characters in the entire buffer, including the remainder.
+* It should detect invalid characters in the entire buffer, regardless of the buffer's length or whether its length is an even multiple of a vector width.
 * It should not read any bytes that don't belong to the provided buffer.
 
 ### Scalar solution
@@ -1012,7 +1012,7 @@ public static unsafe Vector128<ushort> Narrow(Vector128<uint> lower, Vector128<u
 public static unsafe Vector128<uint> Narrow(Vector128<ulong> lower, Vector128<ulong> upper)
 ```
 
-In contrary to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector:
+In contrast to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector:
 
 
 ```cs

From 647b02f8f5527b51b73d515370218d5606b2fc26 Mon Sep 17 00:00:00 2001
From: Jeff Handley <jeffhandley@users.noreply.github.com>
Date: Thu, 6 Apr 2023 18:12:41 -0700
Subject: [PATCH 10/14] Address some of the review feedback

---
 .../vectorization-guidelines.md               | 193 ++++++++++++------
 1 file changed, 125 insertions(+), 68 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index a369deeac055ae..94463579b42d66 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -42,7 +42,7 @@ In recent releases, .NET has introduced many new APIs for vectorization. The vas
 
 ## Code structure
 
-`Vector128<T>` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector of type `T`.
+`Vector128<T>` is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector containing elements of type `T`.
 
 `T` is constrained to specific primitive types:
 
@@ -68,18 +68,73 @@ A single `Vector128` operation allows you to operate on: 16 (s)bytes, 8 (u)short
 -----------------------------------------------------------------
 ```
 
-`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, and the data is large enough and the benchmarks prove that it offers better performance, you should use it instead of `Vector128<T>`. Namely, `Vector256<T>` on x86/x64 is mostly treated as `2x Vector128<T>` and while there are some operations that can "cross lanes", they can sometimes be more expensive or have other hidden costs.
+`Vector256<T>` is twice as big as `Vector128<T>`, so when it is hardware accelerated, the data is large enough, and the benchmarks prove that it offers better performance, you should consider using it instead of `Vector128<T>`. Benchmarking your code can be important as not all platforms treat larger vectors the same.
 
-To check the acceleration, use `Vector128.IsHardwareAccelerated` and `Vector256.IsHardwareAccelerated` properties.
+For example, `Vector256<T>` on x86/x64 is mostly treated as `2x Vector128<T>` rather than `1x Vector256<T>`, where each `Vector128<T>` is considered a "lane".  For most operations, this doesn't present any additional considerations they only operate on individual elements of the vector. However, some operations could "cross lanes" such as shuffling or pairwise operations and that may require additional overhead to handle.
 
-The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). `Vector128<T>.Count` and `Vector256<T>.Count` return the number of elements of the given type T in a single vector.
-Both `Count` and `IsHardwareAccelerated` are turned into constants by the Just-In-Time compiler (i.e. no method call is required to retrieve the information). In the case of pre-compiled code (NativeAOT), this is not true for the `IsHardwareAccelerated` property, as the required information is not available at compile time.
+As an example, consider `Add(Vector128<float> lhs, Vector128<float> rhs)` where you end up effectively doing (pseudo-code):
+```csharp
+result[0] = lhs[0] + rhs[0];
+result[1] = lhs[1] + rhs[1];
+result[2] = lhs[2] + rhs[2];
+result[3] = lhs[3] + rhs[3];
+```
+
+With this algorithm it doesn't matter what size vector we have as we're accessing the same index of the input vectors and only one at a time. So regardless of whether we have `Vector128<T>` or `Vector256<T>` or `Vector512<T>`, it all operates the same.
+
+However, if you then consider `AddPairwise(Vector128<float> lhs, Vector128<float> rhs)` (sometimes called `HorizontalAdd`) where you instead end up effectively doing:
+```csharp
+// process left
+result[0] = lhs[0] + lhs[1];
+result[1] = lhs[2] + lhs[3];
+// process right
+result[2] = rhs[0] + rhs[1];
+result[3] = rhs[2] + rhs[3];
+```
 
-**Note:** When `Vector256` is accelerated then `Vector128` and `Vector64` are also accelerated.
+You may notice that this algorithm would change behavior if expanded up to operate on a single 256-bit vector (note `result[2]` is now `lhs[4] + lhs[6]` and not `rhs[0] + rhs[1]`):
+```csharp
+// process left
+result[0] = lhs[0] + lhs[1];
+result[1] = lhs[2] + lhs[3];
+result[2] = lhs[4] + lhs[5];
+result[3] = lhs[6] + lhs[7];
+// process right
+result[4] = rhs[0] + rhs[1];
+result[5] = rhs[2] + rhs[3];
+result[6] = rhs[4] + rhs[5];
+result[7] = rhs[6] + rhs[7];
+```
+
+Because this behavior would change, the x86/x64 platform opted to treat the operation as `2x Vector128<float>` inputs giving you instead:
+```csharp
+// process lower left
+result[0] = lhs[0] + lhs[1];
+result[1] = lhs[2] + lhs[3];
+// process lower right
+result[2] = rhs[0] + rhs[1];
+result[3] = rhs[2] + rhs[3];
+// process upper left
+result[4] = lhs[4] + lhs[5];
+result[5] = lhs[6] + lhs[7];
+// process upper right
+result[6] = rhs[4] + rhs[5];
+result[7] = rhs[6] + rhs[7];
+```
 
-That is why the code is very often structured like this:
+This ends up preserving behavior and making it much easier to transition from `128-bit` to `256-bit` or higher as you're effectively just unrolling the loop again. It does, however, mean that some algorithms may need additional handling if you need to truly do anything involving the upper and lower lanes together. The exact additional expense here depends on what is being done, what the underlying hardware supports, and several other factors covered in more detail later.
 
-```cs
+### Checking for Hardware Acceleration
+
+To check if a given vector size is hardware accelerated, use the `IsHardwareAccelerated` property on the relevant non-generic vector class. For example, `Vector128.IsHardwareAccelerated` or `Vector256.IsHardwareAccelerated`. Note that even when a vector size is accelerated, there may still be some operations that are not hardware-accelerated; e.g. floating-point division can be accelerated on some hardware while integer division is not.
+
+The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). The `Count` properties (for example `Vector128<T>.Count` or `Vector256<T>.Count`) return the number of elements of the given type T in a single vector.
+
+When `Vector256` is accelerated, `Vector128` generally will be as well, but there's no guarantee of that. The best practice is to always check `IsHardwareAccelerated` explicitly. You may be tempted to cache the values from the `IsHardwareAccelerated` and `Count` properties, but this is not needed or recommended. Both `IsHardwareAccelerated` and `Count` are turned into constants by the Just-In-Time compiler and no method call is required to retrieve the information.
+
+### Example Code Structure
+
+```csharp
 void CodeStructure(ReadOnlySpan<byte> buffer)
 {
     if (Vector256.IsHardwareAccelerated && buffer.Length >= Vector256<byte>.Count)
@@ -99,7 +154,7 @@ void CodeStructure(ReadOnlySpan<byte> buffer)
 
 To reduce the number of comparisons for small inputs, we can re-arrange it in the following way:
 
-```cs
+```csharp
 void OptimalCodeStructure(ReadOnlySpan<byte> buffer)
 {
     if (!Vector128.IsHardwareAccelerated || buffer.Length < Vector128<byte>.Count)
@@ -117,9 +172,11 @@ void OptimalCodeStructure(ReadOnlySpan<byte> buffer)
 }
 ```
 
-**Both vector types provide almost identical features**, but arm64 hardware does not support `Vector256` yet, so for the sake of simplicity we will be using `Vector128` in all examples and assuming **little endian** architecture. Which means that all examples used in this document assume that they are being executed as part of the following `if` block:
+**Both vector types provide the same functionality**, but arm64 hardware does not support `Vector256`, so for the sake of simplicity we will be using `Vector128` in all examples. All examples shown also assume **little endian** architecture and/or do not need to deal with endianness. `BitConverter.IsLittleEndian` is available (and turned into a constant by the JIT) for algorithms that need to consider endianness.
+
+With these assumptions, all examples shown in the document assume that they are being executed as part of the following `if` block:
 
-```cs
+```csharp
 else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<byte>.Count)
 {
     // Vector128 code path
@@ -159,7 +216,7 @@ All that complexity needs to pay off. We need to **benchmark the code to verify
 
 It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios:
 
-```cs
+```csharp
 static void Main(string[] args)
 {
     Job enough = Job.Default
@@ -200,13 +257,13 @@ We have three possibilities:
 
 * We can enforce the alignment ourselves and have very stable results.
 * We can ask the harness to try to randomize the memory and observe the entire possible distribution with each run.
-* We can do nothing and wonder why the results vary from time to time.
+* We can do nothing and wonder why the results have additional noise across many runs.
 
 ##### Enforcing memory alignment
 
 We can allocate aligned unmanaged memory by using the [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc).
 
-```cs
+```csharp
 public unsafe class Benchmarks
 {
     private void* _pointer;
@@ -279,7 +336,7 @@ Example: our input is a buffer of ten integers, assuming that `Vector128` is acc
 Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128 bits / 32 bits) integers in one iteration. In the second loop, we handle the remaining values.
 
 
-```cs
+```csharp
 int Sum(Span<int> buffer)
 {
     Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
@@ -323,11 +380,11 @@ int Sum(Span<int> buffer)
 
 ### Vectorized remainder handling
 
-Now imagine that we need to check whether the given buffer contains a specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
+There are scenarios and advanced techniques that can allow for vectorized remainder handling instead of resorting to the non-vectorized approach illustrated above. Some algorithms could use an approach of backtracking to load one more vector's worth of elements and masking off elements that have already been processed. For idempotent algorithms, it is preferable to simply backtrack and process one last vector, repeating the operation for elements as needed.
 
-Example: a buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements.
+In the example below, we need to check whether the given buffer contains a specific number; processing values more than once is completely acceptable. The buffer contains six 32-bit integers, `Vector128` is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one `Vector128` and we are not mutating the input, we perform a vectorized operation on a `Vector128` containing the last four elements.
 
-```cs
+```csharp
 bool Contains(Span<int> buffer, int searched)
 {
     Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
@@ -365,7 +422,7 @@ bool Contains(Span<int> buffer, int searched)
 
 `Vector128.Create(value)` creates a new vector with all elements initialized to the specified value. So `Vector128<int>.Zero` is equivalent to `Vector128.Create(0)`.
 
-`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector whose elements are all-bits-set or zero, depending on if the provided elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.
+`Vector128.Equals(Vector128 left, Vector128 right)` compares two vectors and returns a vector where each element is either all-bits-set or zero, depending on if the corresponding elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.
 
 ### Access violation (AV) testing
 
@@ -393,7 +450,7 @@ Writing tests that detect that issue is hard, but not impossible. The .NET Team
 
 Both `Vector128` and `Vector256` provide at least five ways of loading them from memory:
 
-```cs
+```csharp
 public static class Vector128
 {
     public static Vector128<T> Load<T>(T* source) where T : unmanaged;
@@ -406,7 +463,7 @@ public static class Vector128
 
 The first three overloads require a pointer to the source. To be able to use a pointer to a managed buffer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right:
 
-```cs
+```csharp
 unsafe int UnmanagedPointersSum(Span<int> buffer)
 {
     fixed (int* pBuffer = buffer)
@@ -438,7 +495,7 @@ unsafe int UnmanagedPointersSum(Span<int> buffer)
 }
 ```
 
-`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity.
+`LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. "NonTemporal" means that the hardware is allowed (but not required) to bypass the cache. Non-temporal reads provide a speedup when working with very large amounts of data as it avoids repeatedly filling the cache with values that will never be used again.
 
 Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
 
@@ -446,7 +503,7 @@ The alternative to creating aligned buffers (we don't always have the control ov
 
 The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic:
 
-```cs
+```csharp
 int ManagedReferencesSum(int[] buffer)
 {
     Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);
@@ -481,7 +538,7 @@ int ManagedReferencesSum(int[] buffer)
 
 **Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects**. In [#73768](https://github.com/dotnet/runtime/pull/73768) a GC hole was introduced. The code looked simple:
 
-```cs
+```csharp
 ref TValue currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128<TValue>.Count);
 
 do
@@ -500,13 +557,13 @@ while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));
 
 It was part of `LastIndexOf` implementation, where we were iterating from the end to the beginning of the buffer. In the last iteration of the loop, `currentSearchSpace` could become a pointer to unknown memory that lied before the beginning of the buffer:
 
-```cs
+```csharp
 currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<TValue>.Count);
 ```
 
 And it was fine until GC kicked right after that, moved objects in memory, updated all valid managed references and resumed the execution, which run following condition:
 
-```cs
+```csharp
 while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));
 ```
 
@@ -514,13 +571,13 @@ Which could return true because `currentSearchSpace` was invalid and not updated
 
 That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.**
 
-```cs
+```csharp
 public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
 ```
 
 **The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.**
 
-```cs
+```csharp
 Span<int> buffer = new int[2] { 1, 2 };
 nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
 Console.WriteLine(oneVectorAwayFromEnd);
@@ -532,7 +589,7 @@ Can you guess the result? For a 64 bit process it's `FFFFFFFFFFFFFFFE` (a hex re
 
 Similarly to loading, both `Vector128` and `Vector256` provide at least five ways of storing them in memory:
 
-```cs
+```csharp
 public static class Vector128
 {
     public static void Store<T>(this Vector128<T> source, T* destination) where T : unmanaged;
@@ -545,7 +602,7 @@ public static class Vector128
 
 For the reasons described for loading, we recommend using the overload that takes managed reference and element offset:
 
-```cs
+```csharp
 public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct;
 ```
 
@@ -557,7 +614,7 @@ As mentioned before, `Vector128<T>` and `Vector256<T>` are constrained to a spec
 
 [Unsafe.As<TFrom, TTo>](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.as#system-runtime-compilerservices-unsafe-as-2(-0@)) can be used to get a reference to supported type:
 
-```cs
+```csharp
 void CastingReferences(Span<char> buffer)
 {
     ref char charSearchSpace = ref MemoryMarshal.GetReference(buffer);
@@ -568,7 +625,7 @@ void CastingReferences(Span<char> buffer)
 
 Or [MemoryMarshal.Cast<TFrom, TTo>](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.memorymarshal.cast#system-runtime-interopservices-memorymarshal-cast-2(system-readonlyspan((-0)))), which casts a span of one primitive type to a span of another primitive type:
 
-```cs
+```csharp
 void CastingSpans(Span<char> chars)
 {
     Span<short> shorts = MemoryMarshal.Cast<char, short>(chars);
@@ -577,7 +634,7 @@ void CastingSpans(Span<char> chars)
 
 It's also possible to get managed references from unmanaged pointers:
 
-```cs
+```csharp
 void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer)
 {
     // of the same type:
@@ -621,7 +678,7 @@ most significant bit
 
 When we look at it, we can realize that another way is checking whether the most significant bit is equal `1`. For the scalar version, we could perform a logical AND:
 
-```cs
+```csharp
 bool IsValidAscii(byte c) => (c & 0b1000_0000) == 0;
 ```
 
@@ -631,7 +688,7 @@ Another step is vectorizing our scalar solution and choosing the best way of doi
 
 If we reuse one of the loops presented in the previous sections, all we need to implement is a method that accepts `Vector128<byte>` and returns `bool` and does exactly the same thing that our scalar method did, but for a vector rather than single value:
 
-```cs
+```csharp
 [MethodImpl(MethodImplOptions.AggressiveInlining)]
 bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
 {
@@ -648,7 +705,7 @@ bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
 
 We can also use the hardware-specific instructions if they are available:
 
-```cs
+```csharp
 if (Sse41.IsSupported)
 {
     return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000));
@@ -692,13 +749,13 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop
 
 Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value.
 
-```cs
+```csharp
 public static Vector128<T> Create<T>(T value) where T : struct;
 ```
 
 `CreateScalar` initializes first element to the specified value, and the remaining elements to zero.
 
-```cs
+```csharp
 public static Vector128<int> CreateScalar(int value);
 ```
 
@@ -707,19 +764,19 @@ public static Vector128<int> CreateScalar(int value);
 
 We also have an overload that allows for specifying every value in given vector:
 
-```cs
+```csharp
 public static Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7)
 ```
 
 And last but not least we have a `Create` overload which accepts a buffer. It creates a vector with its elements set to the first `VectorXYZ<T>.Count` elements of the buffer. It's not recommended to use it in a loop, where `Load` methods should be used instead (for performance).
 
-```cs
+```csharp
 public static Vector128<T> Create<T>(ReadOnlySpan<T> values) where T : struct
 ```
 
 to perform a copy in the other direction, we can use one of the `CopyTo` extension methods:
 
-```cs
+```csharp
 public static void CopyTo<T>(this Vector128<T> vector, Span<T> destination) where T : struct
 ```
 
@@ -729,7 +786,7 @@ All size-specific vector types provide a set of APIs for common bit operations.
 
 `BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`).
 
-```cs
+```csharp
 public static Vector128<T> BitwiseAnd<T>(Vector128<T> left, Vector128<T> right) where T : struct => left & right;
 public static Vector128<T> BitwiseOr<T>(Vector128<T> left, Vector128<T> right) where T : struct => left | right;
 public static Vector128<T> Xor<T>(Vector128<T> left, Vector128<T> right) => left ^ right;
@@ -738,14 +795,14 @@ public static Vector128<T> Negate<T>(Vector128<T> vector) => ~vector;
 
 `AndNot` computes the bitwise-and of a given vector and the ones' complement of another vector.
 
-```cs
+```csharp
 public static Vector128<T> AndNot<T>(Vector128<T> left, Vector128<T> right) => left & ~right;
 ```
 
 `ShiftLeft` shifts each element of a vector left by the specified number of bits.
 `ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift:
 
-```cs
+```csharp
 public static Vector128<sbyte> ShiftLeft(Vector128<sbyte> vector, int shiftCount);
 public static Vector128<sbyte> ShiftRightArithmetic(Vector128<sbyte> vector, int shiftCount);
 public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shiftCount);
@@ -755,20 +812,20 @@ public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shif
 
 `EqualsAll` compares two vectors to determine if all elements are equal. `EqualsAny` compares two vectors to determine if any elements are equal.
 
-```cs
+```csharp
 public static bool EqualsAll<T>(Vector128<T> left, Vector128<T> right) where T : struct => left == right;
 public static bool EqualsAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
 ```
 
 `Equals` compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in the `left` and `right` arguments were equal.
 
-```cs
+```csharp
 public static Vector128<T> Equals<T>(Vector128<T> left, Vector128<T> right) where T : struct
 ```
 
 How do we calculate the index of the first match? Let's take a closer look at the result of following equality check:
 
-```cs
+```csharp
 Vector128<int> left = Vector128.Create(1, 2, 3, 4);
 Vector128<int> right = Vector128.Create(0, 0, 3, 0);
 Vector128<int> equals = Vector128.Equals(left, right);
@@ -781,13 +838,13 @@ Console.WriteLine(equals);
 
 `-1` is just `0xFFFFFFFF` (all-bits-set). We could use `GetElement` to get the first non-zero element.
 
-```cs
+```csharp
 public static T GetElement<T>(this Vector128<T> vector, int index) where T : struct
 ```
 
 But it would not be an optimal solution. We should instead extract the most significant bits:
 
-```cs
+```csharp
 uint mostSignificantBits = equals.ExtractMostSignificantBits();
 Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0'));
 ```
@@ -802,7 +859,7 @@ To calculate the last index, we should use [BitOperations.LeadingZeroCount](http
 
 If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer.
 
-```cs
+```csharp
 int ComputeLastIndex<T>(nint elementOffset, Vector128<T> equals) where T : struct
 {
     uint mostSignificantBits = equals.ExtractMostSignificantBits();
@@ -815,7 +872,7 @@ int ComputeLastIndex<T>(nint elementOffset, Vector128<T> equals) where T : struc
 
 If we were using the `Load` overload that takes only the managed reference, we could use [Unsafe.ByteOffset<T>(ref T, ref T)](https://learn.microsoft.com/dotnet/api/system.runtime.compilerservices.unsafe.byteoffset) to calculate the element offset.
 
-```cs
+```csharp
 unsafe int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
 {
     int elementOffset = (int)Unsafe.ByteOffset(ref searchSpace, ref current) / sizeof(T);
@@ -831,7 +888,7 @@ unsafe int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> e
 
 Beside equality checks, vector APIs allow for comparison. The `bool`-returning overloads return `true` when the given condition is true:
 
-```cs
+```csharp
 public static bool GreaterThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
 public static bool GreaterThanAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
 public static bool GreaterThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
@@ -844,7 +901,7 @@ public static bool LessThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right)
 
 Similarly to `Equals`, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in `left` and `right` meet the given condition.
 
-```cs
+```csharp
 public static Vector128<T> GreaterThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
 public static Vector128<T> GreaterThanOrEqual<T>(Vector128<T> left, Vector128<T> right) where T : struct
 public static Vector128<T> LessThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
@@ -853,13 +910,13 @@ public static Vector128<T> LessThanOrEqual<T>(Vector128<T> left, Vector128<T> ri
 
 `ConditionalSelect` Conditionally selects a value from two vectors on a bitwise basis.
 
-```cs
+```csharp
 public static Vector128<T> ConditionalSelect<T>(Vector128<T> condition, Vector128<T> left, Vector128<T> right)
 ```
 
 This method deserves a self-describing example:
 
-```cs
+```csharp
 Vector128<float> left = Vector128.Create(1.0f, 2, 3, 4);
 Vector128<float> right = Vector128.Create(4.0f, 3, 2, 1);
 
@@ -872,7 +929,7 @@ Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result);
 
 Very simple math operations can be also expressed by using the operators:
 
-```cs
+```csharp
 public static Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right) where T : struct => left + right;
 public static Vector128<T> Divide<T>(Vector128<T> left, Vector128<T> right) => left / right;
 public static Vector128<T> Divide<T>(Vector128<T> left, T right) => left / right;
@@ -885,7 +942,7 @@ public static Vector128<T> Subtract<T>(Vector128<T> left, Vector128<T> right) =>
 
 `Abs`, `Ceiling`, `Floor`, `Max`, `Min`, `Sqrt` and `Sum` are also provided:
 
-```cs
+```csharp
 public static Vector128<T> Abs<T>(Vector128<T> vector) where T : struct
 public static Vector128<double> Ceiling(Vector128<double> vector)
 public static Vector128<float> Floor(Vector128<float> vector)
@@ -899,7 +956,7 @@ public static T Sum<T>(Vector128<T> vector) where T : struct
 
 Vector types provide a set of methods dedicated to number conversions:
 
-```cs
+```csharp
 public static unsafe Vector128<double> ConvertToDouble(Vector128<long> vector)
 public static unsafe Vector128<double> ConvertToDouble(Vector128<ulong> vector)
 public static unsafe Vector128<int> ConvertToInt32(Vector128<float> vector)
@@ -912,7 +969,7 @@ public static unsafe Vector128<ulong> ConvertToUInt64(Vector128<double> vector)
 
 And for reinterpretation (no values are being changed, they can be just used as if they were of a different type):
 
-```cs
+```csharp
 public static Vector128<TTo> As<TFrom, TTo>(this Vector128<TFrom> vector)
 public static Vector128<byte> AsByte<T>(this Vector128<T> vector)
 public static Vector128<double> AsDouble<T>(this Vector128<T> vector)
@@ -946,21 +1003,21 @@ The first half of every vector is called "lower", the second is "upper".
 
 In case of `Vector128`, `GetLower` gets the value of the lower 64-bits as a new `Vector64<T>` and `GetUpper` gets the upper 64-bits.
 
-```cs
+```csharp
 public static Vector64<T> GetLower<T>(this Vector128<T> vector)
 public static Vector64<T> GetUpper<T>(this Vector128<T> vector)
 ```
 
 Each vector type provides a `Create` method that allows for the creation from lower and upper:
 
-```cs
+```csharp
 public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)
 public static Vector256<byte> Create(Vector128<byte> lower, Vector128<byte> upper)
 ```
 
 `Lower` and `Upper` are also used by `Widen`. This method widens a `Vector128<T1>` into two `Vector128<T2>` where `sizeof(T2) == 2 * sizeof(T1)`.
 
-```cs
+```csharp
 public static unsafe (Vector128<ushort> Lower, Vector128<ushort> Upper) Widen(Vector128<byte> source)
 public static unsafe (Vector128<int> Lower, Vector128<int> Upper) Widen(Vector128<short> source)
 public static unsafe (Vector128<long> Lower, Vector128<long> Upper) Widen(Vector128<int> source)
@@ -972,14 +1029,14 @@ public static unsafe (Vector128<ulong> Lower, Vector128<ulong> Upper) Widen(Vect
 
 It's also possible to widen only the lower or upper part:
 
-```cs
+```csharp
 public static Vector128<ushort> WidenLower(Vector128<byte> source)
 public static Vector128<ushort> WidenUpper(Vector128<byte> source)
 ```
 
 An example of widening is converting a buffer of ASCII bytes into characters:
 
-```cs
+```csharp
 byte[] byteBuffer = Enumerable.Range('A', 128 / 8).Select(i => (byte)i).ToArray();
 Vector128<byte> byteVector = Vector128.Create(byteBuffer);
 Console.WriteLine(byteVector);
@@ -1002,7 +1059,7 @@ ABCDEFGHIJKLMNOP
 
 `Narrow` is the opposite of `Widen`.
 
-```cs
+```csharp
 public static unsafe Vector128<float> Narrow(Vector128<double> lower, Vector128<double> upper)
 public static unsafe Vector128<sbyte> Narrow(Vector128<short> lower, Vector128<short> upper)
 public static unsafe Vector128<short> Narrow(Vector128<int> lower, Vector128<int> upper)
@@ -1015,7 +1072,7 @@ public static unsafe Vector128<uint> Narrow(Vector128<ulong> lower, Vector128<ul
 In contrast to [Sse2.PackUnsignedSaturate](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.x86.sse2.packunsignedsaturate) and [AdvSimd.Arm64.UnzipEven](https://learn.microsoft.com/dotnet/api/system.runtime.intrinsics.arm.advsimd.arm64.unzipeven), `Narrow` applies a mask via AND to cut anything above the max value of returned vector:
 
 
-```cs
+```csharp
 Vector256<ushort> ushortVector = Vector256.Create((ushort)300);
 Console.WriteLine(ushortVector);
 unchecked { Console.WriteLine((byte)300); }
@@ -1040,7 +1097,7 @@ if (Sse2.IsSupported)
 
 `Shuffle` creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes of the input vector).
 
-```cs
+```csharp
 public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
 public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
 public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
@@ -1051,7 +1108,7 @@ public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long
 
 It can be used for many things, including reversing the input:
 
-```cs
+```csharp
 Vector128<int> intVector = Vector128.Create(100, 200, 300, 400);
 Console.WriteLine(intVector);
 Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0)));

From a4278c40a261b42be98ff20e8a6cebc389b140cd Mon Sep 17 00:00:00 2001
From: Jeff Handley <jeffhandley@users.noreply.github.com>
Date: Thu, 6 Apr 2023 18:23:34 -0700
Subject: [PATCH 11/14] Fix spelling/hyphenization in a couple places

---
 docs/coding-guidelines/vectorization-guidelines.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 94463579b42d66..7700c792750a0f 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -372,7 +372,7 @@ int Sum(Span<int> buffer)
 }
 ```
 
-**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never dereference it.
+**Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never de-reference it.
 
 **Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan<T>.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference).
 
@@ -497,7 +497,7 @@ unsafe int UnmanagedPointersSum(Span<int> buffer)
 
 `LoadAligned` and `LoadAlignedNonTemporal` require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. "NonTemporal" means that the hardware is allowed (but not required) to bypass the cache. Non-temporal reads provide a speedup when working with very large amounts of data as it avoids repeatedly filling the cache with values that will never be used again.
 
-Currently .NET exposes only one API fo allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
+Currently .NET exposes only one API for allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
 
 The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware.
 
@@ -739,7 +739,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co
 
 Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**.
 
-## Toolchain
+## Tool-Chain
 
 `Vector128`, `Vector128<T>`, `Vector256` and `Vector256<T>` expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed.
 
@@ -1131,7 +1131,7 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas
 
 - If you are already an expert and you have vectorized your code for both `x64/x86` and `arm64/arm` code you can use the new APIs to simplify your code, but you most likely won't observe any performance gains. [#64451](https://github.com/dotnet/runtime/issues/64451) lists the places where it was/can be done in dotnet/runtime. You can use links to the merged PRs to see real-life examples.
 - If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the  new APIs to have a single, cross-platform implementation.
-- If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better codegen.
+- If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better code-gen.
 - If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths.
 
 ### Best practices

From 3c46ba492c6d16e7a039078dd8ca6dc04c159262 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Fri, 19 May 2023 18:45:39 +0200
Subject: [PATCH 12/14] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Günther Foidl <gue@korporal.at>
---
 docs/coding-guidelines/vectorization-guidelines.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index 7700c792750a0f..bd8499bc36f0e9 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -499,7 +499,7 @@ unsafe int UnmanagedPointersSum(Span<int> buffer)
 
 Currently .NET exposes only one API for allocating unmanaged aligned memory: [NativeMemory.AlignedAlloc](https://learn.microsoft.com/dotnet/api/system.runtime.interopservices.nativememory.alignedalloc). In the future, we might provide [a dedicated API](https://github.com/dotnet/runtime/issues/27146) for allocating managed, aligned and hence pinned memory buffers.
 
-The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code is hardly ever worth it and needs to be proved with proper benchmarking on various hardware.
+The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code may not always be worth it and needs to be proved with proper benchmarking on various hardware.
 
 The fourth method expects only a managed reference (`ref T source`). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic:
 
@@ -912,6 +912,7 @@ public static Vector128<T> LessThanOrEqual<T>(Vector128<T> left, Vector128<T> ri
 
 ```csharp
 public static Vector128<T> ConditionalSelect<T>(Vector128<T> condition, Vector128<T> left, Vector128<T> right)
+    => (left & condition) | (right & ~condition);
 ```
 
 This method deserves a self-describing example:
@@ -1144,5 +1145,5 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas
 6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128<T> source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow!
 7. Always handle the vectorized loop remainder.
 8. When storing values in memory, be aware of a potential buffer overlap.
-9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs. Over time, you may gain enough experience to skip the scalar step.
+9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs.
 10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.

From 556e64b9049354969e0576666f6b40843ad92d94 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Thu, 25 May 2023 17:20:53 +0200
Subject: [PATCH 13/14] addressing the code review comments that don't require
 the reordering of introduced concepts

---
 .../vectorization-guidelines.md               | 78 ++++++++++---------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index bd8499bc36f0e9..bec4035c7cde66 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -214,7 +214,7 @@ All that complexity needs to pay off. We need to **benchmark the code to verify
 
 #### Custom config
 
-It's possible to define a config that instructs the harness to run the benchmarks for all four scenarios:
+It's possible to define a config that instructs the harness to run the benchmarks for all three scenarios:
 
 ```csharp
 static void Main(string[] args)
@@ -310,7 +310,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co
 | Contains | Vector256 | 1024 |  55.769 ns | 0.6720 ns |  0.39 |     391 B |
 ```
 
-The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large and its entire contents are searched through, as the value is never found).
+The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large, aligned and its entire contents are searched through, as the value is never found).
 
 Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements.
 
@@ -374,7 +374,11 @@ int Sum(Span<int> buffer)
 
 **Note:** Use `ref MemoryMarshal.GetReference(span)` instead of `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead of `ref array[0]` to handle empty buffer scenarios (which would throw `IndexOutOfRangeException`). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never de-reference it.
 
-**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you need to use [ReadOnlySpan<T>.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference).
+**Note:** The `GetReference` method has an overload that accepts a `ReadOnlySpan` and returns mutable reference. Please use it with caution! To get a `readonly` reference, you can use [ReadOnlySpan<T>.GetPinnableReference](https://learn.microsoft.com/dotnet/api/system.readonlyspan-1.getpinnablereference) or just do the following:
+
+```csharp
+ref readonly T searchSpace = ref MemoryMarshal.GetReference(buffer);
+```
 
 **Note:** Please keep in mind that `Vector128.Sum` is a static method. `Vectior128<T>` and `Vector256<T>` provide both instance and static methods (operators like `+` are just static methods in C#). `Vector128` and `Vector256` are non-generic static classes with static methods only. It's important to know about their existence when searching for methods.
 
@@ -453,11 +457,11 @@ Both `Vector128` and `Vector256` provide at least five ways of loading them from
 ```csharp
 public static class Vector128
 {
-    public static Vector128<T> Load<T>(T* source) where T : unmanaged;
-    public static Vector128<T> LoadAligned<T>(T* source) where T : unmanaged;
-    public static Vector128<T> LoadAlignedNonTemporal<T>(T* source) where T : unmanaged;
-    public static Vector128<T> LoadUnsafe<T>(ref T source) where T : struct;
-    public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
+    public static Vector128<T> Load<T>(T* source) where T : unmanaged
+    public static Vector128<T> LoadAligned<T>(T* source) where T : unmanaged
+    public static Vector128<T> LoadAlignedNonTemporal<T>(T* source) where T : unmanaged
+    public static Vector128<T> LoadUnsafe<T>(ref T source) where T : struct
+    public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct
 }
 ```
 
@@ -572,7 +576,7 @@ Which could return true because `currentSearchSpace` was invalid and not updated
 That is why **we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.**
 
 ```csharp
-public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct;
+public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct
 ```
 
 **The only thing we need to keep in mind is potential `nuint` overflow when doing unsigned integer arithmetic.**
@@ -592,18 +596,18 @@ Similarly to loading, both `Vector128` and `Vector256` provide at least five way
 ```csharp
 public static class Vector128
 {
-    public static void Store<T>(this Vector128<T> source, T* destination) where T : unmanaged;
-    public static void StoreAligned<T>(this Vector128<T> source, T* destination) where T : unmanaged;
-    public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination) where T : unmanaged;
-    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination) where T : struct;
-    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct;
+    public static void Store<T>(this Vector128<T> source, T* destination) where T : unmanaged
+    public static void StoreAligned<T>(this Vector128<T> source, T* destination) where T : unmanaged
+    public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination) where T : unmanaged
+    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination) where T : struct
+    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct
 }
 ```
 
 For the reasons described for loading, we recommend using the overload that takes managed reference and element offset:
 
 ```csharp
-public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct;
+public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct
 ```
 
 **Note**: when loading values from one buffer and storing them into another, we need to consider whether they overlap or not. [MemoryExtensions.Overlap](https://learn.microsoft.com/dotnet/api/system.memoryextensions.overlaps#system-memoryextensions-overlaps-1(system-readonlyspan((-0))-system-readonlyspan((-0)))) is an API for doing that.
@@ -690,16 +694,16 @@ If we reuse one of the loops presented in the previous sections, all we need to
 
 ```csharp
 [MethodImpl(MethodImplOptions.AggressiveInlining)]
-bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
+bool IsValidAscii(Vector128<byte> vector)
 {
     // to perform "> 127" check we can use GreaterThanAny method:
-    return Vector128.GreaterThanAny(asciiVector, Vector128.Create((byte)127))
+    return !Vector128.GreaterThanAny(vector, Vector128.Create((byte)127))
     // to perform "< 0" check, we need to use AsSByte and LessThanAny methods:
-    return Vector128.LessThanAny(asciiVector.AsSByte(), Vector128<sbyte>.Zero)
+    return !Vector128.LessThanAny(vector.AsSByte(), Vector128<sbyte>.Zero)
     // to perform an AND operation, we need to use & operator
-    return (asciiVector & Vector128.Create((byte)0b_1000_0000)) != Vector128<byte>.Zero;
+    return (vector & Vector128.Create((byte)0b_1000_0000)) == Vector128<byte>.Zero;
     // we can also just use ExtractMostSignificantBits method:
-    return asciiVector.ExtractMostSignificantBits() != 0;
+    return vector.ExtractMostSignificantBits() == 0;
 }
 ```
 
@@ -708,12 +712,12 @@ We can also use the hardware-specific instructions if they are available:
 ```csharp
 if (Sse41.IsSupported)
 {
-    return !Sse41.TestZ(asciiVector, Vector128.Create((byte)0b_1000_0000));
+    return Sse41.TestZ(vector, Vector128.Create((byte)0b_1000_0000));
 }
 else if (AdvSimd.Arm64.IsSupported)
 {
-    Vector128<byte> maxBytes = AdvSimd.Arm64.MaxPairwise(asciiVector, asciiVector);
-    return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) != 0;
+    Vector128<byte> maxBytes = AdvSimd.Arm64.MaxPairwise(vector, vector);
+    return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) == 0;
 }
 ```
 
@@ -737,7 +741,7 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co
 | ExtractMostSignificantBits | 1024 |  27.33 ns |  0.11 |     141 B |
 ```
 
-Even such a simple problem can be solved in at least 5 different ways. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**.
+Even such a simple problem can be solved in at least 5 different ways and each of them can perform significantly different on different hardware. Using sophisticated hardware-specific instructions does not always provide the best performance, so **with the new `Vector128` and `Vector256` APIs we don't need to become assembly language experts to write fast, vectorized code**.
 
 ## Tool-Chain
 
@@ -750,13 +754,13 @@ Even such a simple problem can be solved in at least 5 different ways. Using sop
 Each of the vector types provides a `Create` method that accepts a single value and returns a vector with all elements initialized to this value.
 
 ```csharp
-public static Vector128<T> Create<T>(T value) where T : struct;
+public static Vector128<T> Create<T>(T value) where T : struct
 ```
 
 `CreateScalar` initializes first element to the specified value, and the remaining elements to zero.
 
 ```csharp
-public static Vector128<int> CreateScalar(int value);
+public static Vector128<int> CreateScalar(int value)
 ```
 
 `CreateScalarUnsafe` is similar, but the remaining elements are left uninitialized. It's dangerous!
@@ -786,6 +790,8 @@ All size-specific vector types provide a set of APIs for common bit operations.
 
 `BitwiseAnd` computes the bitwise-and of two vectors, `BitwiseOr` computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (`&` and `|`). The same goes for `Xor` which can be expressed with `^` operator and `Negate` (`~`).
 
+**Note:** The **operators should be preferred where possible**, as it helps avoid bugs around operator precedence and can improve readability.
+
 ```csharp
 public static Vector128<T> BitwiseAnd<T>(Vector128<T> left, Vector128<T> right) where T : struct => left & right;
 public static Vector128<T> BitwiseOr<T>(Vector128<T> left, Vector128<T> right) where T : struct => left | right;
@@ -803,9 +809,9 @@ public static Vector128<T> AndNot<T>(Vector128<T> left, Vector128<T> right) => l
 `ShiftRightArithmetic` performs a **signed** shift right and `ShiftRightLogical` performs an **unsigned** shift:
 
 ```csharp
-public static Vector128<sbyte> ShiftLeft(Vector128<sbyte> vector, int shiftCount);
-public static Vector128<sbyte> ShiftRightArithmetic(Vector128<sbyte> vector, int shiftCount);
-public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shiftCount);
+public static Vector128<sbyte> ShiftLeft(Vector128<sbyte> vector, int shiftCount) => vector << shiftCount;
+public static Vector128<sbyte> ShiftRightArithmetic(Vector128<sbyte> vector, int shiftCount) => vector >> shiftCount;
+public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shiftCount) => vector >>> shiftCount;
 ```
 
 ### Equality
@@ -853,9 +859,9 @@ Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0'));
 00000000000000000000000000000100
 ```
 
-and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) to get the trailing zero count.
+and use [BitOperations.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.trailingzerocount) or [uint.TrailingZeroCount](https://learn.microsoft.com/dotnet/api/system.uint32.trailingzerocount) (introduced in .NET 7) to get the trailing zero count.
 
-To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0).
+To calculate the last index, we should use [BitOperations.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.numerics.bitoperations.leadingzerocount) or [uint.LeadingZeroCount](https://learn.microsoft.com/dotnet/api/system.uint32.leadingzerocount) (introduced in .NET 7). But the returned value needs to be subtracted from 31 (32 bits in an `unit`, indexed from 0).
 
 If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the `elementOffset` provided to the `Load` method that was used to load the vector from the buffer.
 
@@ -928,7 +934,7 @@ Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result);
 
 ### Math
 
-Very simple math operations can be also expressed by using the operators:
+Very simple math operations can be also expressed by using the operators. The operators should be preferred where possible, as it helps avoid bugs around operator precedence and can improve readability.
 
 ```csharp
 public static Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right) where T : struct => left + right;
@@ -946,10 +952,12 @@ public static Vector128<T> Subtract<T>(Vector128<T> left, Vector128<T> right) =>
 ```csharp
 public static Vector128<T> Abs<T>(Vector128<T> vector) where T : struct
 public static Vector128<double> Ceiling(Vector128<double> vector)
+public static Vector128<float> Ceiling(Vector128<float> vector)
+public static Vector128<double> Floor(Vector128<double> vector)
 public static Vector128<float> Floor(Vector128<float> vector)
-public static Vector128<T> Max<T>(Vector128<T> left, Vector128<T> right)
-public static Vector128<T> Min<T>(Vector128<T> left, Vector128<T> right)
-public static Vector128<T> Sqrt<T>(Vector128<T> vector);
+public static Vector128<T> Max<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static Vector128<T> Min<T>(Vector128<T> left, Vector128<T> right) where T : struct
+public static Vector128<T> Sqrt<T>(Vector128<T> vector) where T : struct
 public static T Sum<T>(Vector128<T> vector) where T : struct
 ```
 

From 281bbb484ac467d205ee590e6f387e606abfd571 Mon Sep 17 00:00:00 2001
From: Adam Sitnik <adam.sitnik@gmail.com>
Date: Tue, 30 May 2023 19:43:00 +0200
Subject: [PATCH 14/14] more polishing:

* update TOC
* add note about imperfect perf boost
* don't recommend managed references over unsafe pointers, as they can both be dangerous when used incorrectly
---
 .../vectorization-guidelines.md               | 21 ++++++++++++-------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/docs/coding-guidelines/vectorization-guidelines.md b/docs/coding-guidelines/vectorization-guidelines.md
index bec4035c7cde66..ab85676263afd6 100644
--- a/docs/coding-guidelines/vectorization-guidelines.md
+++ b/docs/coding-guidelines/vectorization-guidelines.md
@@ -1,5 +1,7 @@
 - [Introduction to vectorization with Vector128 and Vector256](#introduction-to-vectorization-with-vector128-and-vector256)
   * [Code structure](#code-structure)
+    + [Checking for Hardware Acceleration](#checking-for-hardware-acceleration)
+    + [Example Code Structure](#example-code-structure)
     + [Testing](#testing)
     + [Benchmarking](#benchmarking)
       - [Custom config](#custom-config)
@@ -18,7 +20,7 @@
     + [Edge cases](#edge-cases)
     + [Scalar solution](#scalar-solution)
     + [Vectorized solution](#vectorized-solution)
-  * [Toolchain](#toolchain)
+  * [Tool-Chain](#tool-chain)
     + [Creation](#creation)
     + [Bit operations](#bit-operations)
     + [Equality](#equality)
@@ -27,6 +29,7 @@
     + [Conversion](#conversion)
     + [Widening and Narrowing](#widening-and-narrowing)
     + [Shuffle](#shuffle)
+      - [Vector256.Shuffle vs Avx2.Shuffle](#vector256shuffle-vs-avx2shuffle)
   * [Summary](#summary)
     + [Best practices](#best-practices)
 
@@ -310,6 +313,8 @@ AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical co
 | Contains | Vector256 | 1024 |  55.769 ns | 0.6720 ns |  0.39 |     391 B |
 ```
 
+**Note:** as you can see, even such simple method like [Contains](https://learn.microsoft.com/dotnet/api/system.memoryextensions.contains) **did not observe a perfect performance boost**: x8 for `Vector256` (256/32) and x4 for `Vector128` (128/32). To understand why, we would need to use a profiler that provides information on CPU instruction level, which depending on the hardware could be [Intel VTune](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html) or [amd uprof](https://developer.amd.com/amd-uprof/).
+
 The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large, aligned and its entire contents are searched through, as the value is never found).
 
 Explaining benchmark design guidelines is outside of the scope of this document, but we have a [dedicated document](https://github.com/dotnet/performance/blob/main/docs/microbenchmark-design-guidelines.md#benchmarks-are-not-unit-tests) about it. To make a long story short, **you should benchmark all scenarios that are realistic for your production environment**, so your customers can actually benefit from your improvements.
@@ -1142,16 +1147,16 @@ The main goal of the new `Vector128` and `Vector256` APIs is to make writing fas
 - If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the  new APIs to have a single, cross-platform implementation.
 - If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better code-gen.
 - If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths.
+- Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff.
 
 ### Best practices
 
 1. Implement tests that cover all code paths, including Access Violations.
 2. Run tests for all hardware acceleration scenarios, use the existing environment variables to do that.
 3. Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users.
-4. Prefer managed references over unsafe pointers to avoid pinning and safety issues.
-5. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly.
-6. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128<T> source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow!
-7. Always handle the vectorized loop remainder.
-8. When storing values in memory, be aware of a potential buffer overlap.
-9. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs.
-10. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.
+4. Use `ref MemoryMarshal.GetReference(span)` instead `ref span[0]` and `ref MemoryMarshal.GetArrayDataReference(array)` instead `ref array[0]` to handle empty buffers correctly.
+5. Prefer `LoadUnsafe(ref T, nuint elementOffset)` and `StoreUnsafe(this Vector128<T> source, ref T destination, nuint elementOffset)` over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow!
+6. Always handle the vectorized loop remainder.
+7. When storing values in memory, be aware of a potential buffer overlap.
+8. When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs.
+9. Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.