-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduction to vectorization with Vector128 and Vector256 #84115
Introduction to vectorization with Vector128 and Vector256 #84115
Conversation
Tagging subscribers to this area: @dotnet/area-meta Issue DetailsDisclaimer:
|
848cc7e
to
ce1840f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice write-up.
Co-authored-by: Günther Foidl <gue@korporal.at>
@gfoidl big thanks for a very detailed review!! 🥇 |
Does Vector512 need a coverage in the doc? It seems to be already usable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read half and added minor suggestions.. great doc
Co-authored-by: Dan Moseley <danmose@microsoft.com> Co-authored-by: Günther Foidl <gue@korporal.at>
@danmoseley Thank you very much! |
|
||
### Vectorized remainder handling | ||
|
||
Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. | |
Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. This can be done for all idempotent operations (e.g. also copying). |
To make this a bit more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar suggestions, as you asked for it 🙂
Co-authored-by: Rob Hague <rob.hague00@gmail.com> Co-authored-by: Günther Foidl <gue@korporal.at>
Co-authored-by: Rob Hague <rob.hague00@gmail.com>
@Rob-Hague thank you very much! It's very motivating to get such detailed and helpful review! |
Co-authored-by: Rob Hague <rob.hague00@gmail.com>
Will give this a pass this afternoon. |
Vector128<int> sum = Vector128<int>.Zero; | ||
|
||
while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd)) | ||
{ | ||
sum += Vector128.LoadUnsafe(ref current); | ||
|
||
current = ref Unsafe.Add(ref current, Vector128<int>.Count); | ||
} | ||
|
||
int result = Vector128.Sum(sum); | ||
|
||
while (Unsafe.IsAddressLessThan(ref current, ref end)) | ||
{ | ||
result += current; | ||
|
||
current = ref Unsafe.Add(ref current, 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is much preferred to use LoadUnsafe(ref bufferStart, index)
over using Unsafe.Add
It is very easy to get Unsafe.Add
wrong, particularly with moving the ref
outside the GC tracked region for the buffer. We have written a few bugs in that space ourselves.
LoadUnsafe(ref T address, nuint index)
pushes users towards writing code that is less prone to this general issue and is overall easier to adapt to from a simpler for
loop that the scalar logic may be using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.
That is why in the paragraph above (Loops
), where I need to load the values, I am using the recommended way (so it's the first thing the users see). In this paragraph, I start on purpose with the "wrong" way, just to point out the disadvantages, move on & repat and at the end I mention the recommended approach.
## Mindset | ||
|
||
Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly. | ||
|
||
Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have the example of vectorization earlier in the doc.
Basically:
- Brief introduction to what vectors are and what .NET provides at a high level
- An example of taking a scalar algorithm and vectorizing it, showing perf wins and touching briefly on what we did
- A more in depth coverage of the APIs we provide, best practices, etc
- A more in depth coverage of what not to do and why
- Optionally some coverage of advanced topics/scenarios, such as handling trailing elements or alignment efficiently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This different ordering would walk users through a very natural process introducing key terms, giving an example of what can be done, and then breaking it apart and going into more detail.
As we have it now, there is a lot of detail up front before we ever get into showing a user the "simple example" of taking a scalar algorithm and vectorizing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have the example of vectorization earlier in the doc.
In a way we already have that, as in the "Loops" paragraph I provide a working Sum
and Contains
method implementations with comments.
I've spent a lot of time thinking about your recommendation and I believe it's a good one, but I would prefer to keep the existing order.
My main motivation is to prevent the readers from running into various traps, so I really want them to read the whole document. I am hoping that with the current structure, our target readers are going to learn something interesting in every paragraph, which is going to keep them focused and reach the end without missing anything.
With your proposal, I am afraid that some of the readers could skip some parts of it and run into issues later.
According to this #84635 (comment) each method must protect its own use of hardware intrinsics, it cannot just have a caller check and then assume (or assert) that some ISA is available. If that's the case it should perhaps be mentioned in here. edit: oh apparently that's just in corelib |
BTW, will this go on learn.microsoft.com? Something like this probably should be there. |
Co-authored-by: Günther Foidl <gue@korporal.at>
… of introduced concepts
* update TOC * add note about imperfect perf boost * don't recommend managed references over unsafe pointers, as they can both be dangerous when used incorrectly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding @stephentoub I believe that the PR is ready for another round of review. PTAL at the unresolved comments and let me know what do you think
| Contains | Scalar | 1024 | 143.844 ns | 0.6234 ns | 1.00 | 206 B | | ||
| Contains | Vector128 | 1024 | 104.544 ns | 1.2792 ns | 0.73 | 335 B | | ||
| Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about it a little bit more and I got to the following conclusion: it's not easy to find a built-in .NET method that provides a perfect perf boost thanks to vectorization. Even simple methods like IndexOf
and Contains
don't provide it for buffers of small primitive types.
That is why such boost might be rare and we should simply point the users to the right profilers which can tell them where the CPU time is spent.
I've added following sentece below the table with the results:
Note: as you can see, even such simple method like Contains did not observe a perfect performance boost: x8 for
Vector256
(256/32) and x4 forVector128
(128/32). To understand why, we would need to use a profiler that provides information on CPU instruction level, which depending on the hardware could be Intel VTune or amd uprof.
@stephentoub @tannergooding thoughts?
Vector128<int> sum = Vector128<int>.Zero; | ||
|
||
while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd)) | ||
{ | ||
sum += Vector128.LoadUnsafe(ref current); | ||
|
||
current = ref Unsafe.Add(ref current, Vector128<int>.Count); | ||
} | ||
|
||
int result = Vector128.Sum(sum); | ||
|
||
while (Unsafe.IsAddressLessThan(ref current, ref end)) | ||
{ | ||
result += current; | ||
|
||
current = ref Unsafe.Add(ref current, 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.
That is why in the paragraph above (Loops
), where I need to load the values, I am using the recommended way (so it's the first thing the users see). In this paragraph, I start on purpose with the "wrong" way, just to point out the disadvantages, move on & repat and at the end I mention the recommended approach.
## Mindset | ||
|
||
Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly. | ||
|
||
Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have the example of vectorization earlier in the doc.
In a way we already have that, as in the "Loops" paragraph I provide a working Sum
and Contains
method implementations with comments.
I've spent a lot of time thinking about your recommendation and I believe it's a good one, but I would prefer to keep the existing order.
My main motivation is to prevent the readers from running into various traps, so I really want them to read the whole document. I am hoping that with the current structure, our target readers are going to learn something interesting in every paragraph, which is going to keep them focused and reach the end without missing anything.
With your proposal, I am afraid that some of the readers could skip some parts of it and run into issues later.
- If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the new APIs to have a single, cross-platform implementation. | ||
- If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better code-gen. | ||
- If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths. | ||
- Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer managed references over unsafe pointers to avoid pinning and safety issues.
I've removed this point from the Best practices and added following sentence to the Summary:
Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff.
I don't think that question got answered yet, @danmoseley. This document won't go out there, but this was a step toward being able to do that. The general audience will need a slightly different format/flow/ordering that can build the concepts up likely more in the order that @tannergooding had suggested in some of his feedback. We wanted to tackle this smaller audience of dotnet/runtime contributors/maintainers first. |
Fair enough. With respect to reframing it for docs use, that seems like something someone in the docs team might be open to doing, and then it can be iterated on in the docs repo. I'm reminded of how we moved many of our interop guidelines out of a markdown in this repo to the docs and they benefited everyone then. |
way too late for this comment, but in reading it fresh, I realized that we don't establish up front how these relate to |
I presume the doc will need to be updated to include |
That's still internal only (and even only to corelib right now). More pressing would be updating for Vector512 :) |
Disclaimer: