Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction to vectorization with Vector128 and Vector256 #84115

Merged
merged 14 commits into from
Jun 29, 2023

Conversation

adamsitnik
Copy link
Member

Disclaimer:

  • I am not a vectorization expert, that is why I was asked to get familiar with the APIs and write an introduction for other beginers. Please don't treate this document as source of truth until it becomes reviewed by the experts, approved and merged.
  • I am not a native english speaker and I am always more than happy to improve my skills, so please don't hesitate to point out grammar mistakes or places for improvements. Suggestions are preffered ;)

@ghost
Copy link

ghost commented Mar 30, 2023

Tagging subscribers to this area: @dotnet/area-meta
See info in area-owners.md if you want to be subscribed.

Issue Details

Disclaimer:

  • I am not a vectorization expert, that is why I was asked to get familiar with the APIs and write an introduction for other beginers. Please don't treate this document as source of truth until it becomes reviewed by the experts, approved and merged.
  • I am not a native english speaker and I am always more than happy to improve my skills, so please don't hesitate to point out grammar mistakes or places for improvements. Suggestions are preffered ;)
Author: adamsitnik
Assignees: -
Labels:

area-Meta

Milestone: -

@adamsitnik adamsitnik force-pushed the introductionToVectorization branch from 848cc7e to ce1840f Compare March 30, 2023 08:43
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice write-up.

Co-authored-by: Günther Foidl <gue@korporal.at>
@adamsitnik
Copy link
Member Author

@gfoidl big thanks for a very detailed review!! 🥇

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2023

Does Vector512 need a coverage in the doc? It seems to be already usable

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read half and added minor suggestions.. great doc

Co-authored-by: Dan Moseley <danmose@microsoft.com>
Co-authored-by: Günther Foidl <gue@korporal.at>
@adamsitnik
Copy link
Member Author

Read half and added minor suggestions.. great doc

@danmoseley Thank you very much!


### Vectorized remainder handling

Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion.
Now imagine that we need to check whether the given buffer contains specific number. In this case, processing some values more than once is acceptable, we don't need to handle the remainder in a non-vectorized fashion. This can be done for all idempotent operations (e.g. also copying).

To make this a bit more general?

Copy link
Contributor

@Rob-Hague Rob-Hague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar suggestions, as you asked for it 🙂

adamsitnik and others added 2 commits March 31, 2023 15:01
Co-authored-by: Rob Hague <rob.hague00@gmail.com>
Co-authored-by: Günther Foidl <gue@korporal.at>
Co-authored-by: Rob Hague <rob.hague00@gmail.com>
@adamsitnik
Copy link
Member Author

Grammar suggestions, as you asked for it

@Rob-Hague thank you very much! It's very motivating to get such detailed and helpful review!

Co-authored-by: Rob Hague <rob.hague00@gmail.com>
@tannergooding
Copy link
Member

Will give this a pass this afternoon.

Comment on lines +458 to +474
Vector128<int> sum = Vector128<int>.Zero;

while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd))
{
sum += Vector128.LoadUnsafe(ref current);

current = ref Unsafe.Add(ref current, Vector128<int>.Count);
}

int result = Vector128.Sum(sum);

while (Unsafe.IsAddressLessThan(ref current, ref end))
{
result += current;

current = ref Unsafe.Add(ref current, 1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is much preferred to use LoadUnsafe(ref bufferStart, index) over using Unsafe.Add

It is very easy to get Unsafe.Add wrong, particularly with moving the ref outside the GC tracked region for the buffer. We have written a few bugs in that space ourselves.

LoadUnsafe(ref T address, nuint index) pushes users towards writing code that is less prone to this general issue and is overall easier to adapt to from a simpler for loop that the scalar logic may be using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.

That is why in the paragraph above (Loops), where I need to load the values, I am using the recommended way (so it's the first thing the users see). In this paragraph, I start on purpose with the "wrong" way, just to point out the disadvantages, move on & repat and at the end I mention the recommended approach.

Comment on lines +592 to +596
## Mindset

Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly.

Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the example of vectorization earlier in the doc.

Basically:

  1. Brief introduction to what vectors are and what .NET provides at a high level
  2. An example of taking a scalar algorithm and vectorizing it, showing perf wins and touching briefly on what we did
  3. A more in depth coverage of the APIs we provide, best practices, etc
  4. A more in depth coverage of what not to do and why
  5. Optionally some coverage of advanced topics/scenarios, such as handling trailing elements or alignment efficiently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This different ordering would walk users through a very natural process introducing key terms, giving an example of what can be done, and then breaking it apart and going into more detail.

As we have it now, there is a lot of detail up front before we ever get into showing a user the "simple example" of taking a scalar algorithm and vectorizing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the example of vectorization earlier in the doc.

In a way we already have that, as in the "Loops" paragraph I provide a working Sum and Contains method implementations with comments.

I've spent a lot of time thinking about your recommendation and I believe it's a good one, but I would prefer to keep the existing order.

My main motivation is to prevent the readers from running into various traps, so I really want them to read the whole document. I am hoping that with the current structure, our target readers are going to learn something interesting in every paragraph, which is going to keep them focused and reach the end without missing anything.

With your proposal, I am afraid that some of the readers could skip some parts of it and run into issues later.

@danmoseley
Copy link
Member

danmoseley commented Apr 21, 2023

According to this #84635 (comment) each method must protect its own use of hardware intrinsics, it cannot just have a caller check and then assume (or assert) that some ISA is available. If that's the case it should perhaps be mentioned in here.

edit: oh apparently that's just in corelib

@danmoseley
Copy link
Member

BTW, will this go on learn.microsoft.com? Something like this probably should be there.

adamsitnik and others added 3 commits May 19, 2023 18:45
Co-authored-by: Günther Foidl <gue@korporal.at>
* update TOC
* add note about imperfect perf boost
* don't recommend managed references over unsafe pointers, as they can both be dangerous when used incorrectly
Copy link
Member Author

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding @stephentoub I believe that the PR is ready for another round of review. PTAL at the unresolved comments and let me know what do you think

Comment on lines +221 to +223
| Contains | Scalar | 1024 | 143.844 ns | 0.6234 ns | 1.00 | 206 B |
| Contains | Vector128 | 1024 | 104.544 ns | 1.2792 ns | 0.73 | 335 B |
| Contains | Vector256 | 1024 | 55.769 ns | 0.6720 ns | 0.39 | 391 B |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about it a little bit more and I got to the following conclusion: it's not easy to find a built-in .NET method that provides a perfect perf boost thanks to vectorization. Even simple methods like IndexOf and Contains don't provide it for buffers of small primitive types.

That is why such boost might be rare and we should simply point the users to the right profilers which can tell them where the CPU time is spent.

I've added following sentece below the table with the results:

Note: as you can see, even such simple method like Contains did not observe a perfect performance boost: x8 for Vector256 (256/32) and x4 for Vector128 (128/32). To understand why, we would need to use a profiler that provides information on CPU instruction level, which depending on the hardware could be Intel VTune or amd uprof.

@stephentoub @tannergooding thoughts?

Comment on lines +458 to +474
Vector128<int> sum = Vector128<int>.Zero;

while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd))
{
sum += Vector128.LoadUnsafe(ref current);

current = ref Unsafe.Add(ref current, Vector128<int>.Count);
}

int result = Vector128.Sum(sum);

while (Unsafe.IsAddressLessThan(ref current, ref end))
{
result += current;

current = ref Unsafe.Add(ref current, 1);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we touch on this below, I think it would be much clearer to cover the "appropriate" way to do it here first in the normal reading order and then to cover things like this in a separate "what not to do and why" section.

That is why in the paragraph above (Loops), where I need to load the values, I am using the recommended way (so it's the first thing the users see). In this paragraph, I start on purpose with the "wrong" way, just to point out the disadvantages, move on & repat and at the end I mention the recommended approach.

Comment on lines +592 to +596
## Mindset

Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly.

Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the example of vectorization earlier in the doc.

In a way we already have that, as in the "Loops" paragraph I provide a working Sum and Contains method implementations with comments.

I've spent a lot of time thinking about your recommendation and I believe it's a good one, but I would prefer to keep the existing order.

My main motivation is to prevent the readers from running into various traps, so I really want them to read the whole document. I am hoping that with the current structure, our target readers are going to learn something interesting in every paragraph, which is going to keep them focused and reach the end without missing anything.

With your proposal, I am afraid that some of the readers could skip some parts of it and run into issues later.

- If you have already vectorized your code, but only for `x64/x86` or `arm64/arm`, you can use the new APIs to have a single, cross-platform implementation.
- If you have already vectorized your code with `Vector<T>` you can use the new APIs to check if they can produce better code-gen.
- If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new `Vector128` and `Vector256` APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths.
- Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer managed references over unsafe pointers to avoid pinning and safety issues.

I've removed this point from the Best practices and added following sentence to the Summary:

Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff.

@jeffhandley
Copy link
Member

BTW, will this go on learn.microsoft.com? Something like this probably should be there.

I don't think that question got answered yet, @danmoseley. This document won't go out there, but this was a step toward being able to do that. The general audience will need a slightly different format/flow/ordering that can build the concepts up likely more in the order that @tannergooding had suggested in some of his feedback. We wanted to tackle this smaller audience of dotnet/runtime contributors/maintainers first.

@danmoseley
Copy link
Member

BTW, will this go on learn.microsoft.com? Something like this probably should be there.

I don't think that question got answered yet, @danmoseley. This document won't go out there, but this was a step toward being able to do that. The general audience will need a slightly different format/flow/ordering that can build the concepts up likely more in the order that @tannergooding had suggested in some of his feedback. We wanted to tackle this smaller audience of dotnet/runtime contributors/maintainers first.

Fair enough. With respect to reframing it for docs use, that seems like something someone in the docs team might be open to doing, and then it can be iterated on in the docs repo. I'm reminded of how we moved many of our interop guidelines out of a markdown in this repo to the docs and they benefited everyone then.

@adamsitnik adamsitnik merged commit 1b37389 into dotnet:main Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
@danmoseley
Copy link
Member

way too late for this comment, but in reading it fresh, I realized that we don't establish up front how these relate to Vector<T>. There's only one mention of that, and it's at the end. Eg., is Vector<T> ever worth using in new code.

@EgorBo
Copy link
Member

EgorBo commented Oct 3, 2023

way too late for this comment, but in reading it fresh, I realized that we don't establish up front how these relate to Vector<T>. There's only one mention of that, and it's at the end. Eg., is Vector<T> ever worth using in new code.

I presume the doc will need to be updated to include ISimdVector as well once it's production-ready cc @tannergooding

@stephentoub
Copy link
Member

way too late for this comment, but in reading it fresh, I realized that we don't establish up front how these relate to Vector<T>. There's only one mention of that, and it's at the end. Eg., is Vector<T> ever worth using in new code.

I presume the doc will need to be updated to include ISimdVector as well once it's production-ready cc @tannergooding

That's still internal only (and even only to corelib right now).

More pressing would be updating for Vector512 :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants