Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ESIMD] Optimize the simd stride constructor #12553

Merged
merged 3 commits into from
Feb 5, 2024

Conversation

v-klochkov
Copy link
Contributor

simd(base, stride) calls previously were lowered into a long sequence of INSERT and ADD operations. That sequence is replaced with a vector equivalent:
vbase = broadcast base
vstride = broadcast stride
vstride_coef = {0, 1, 2, 3, ... N-1}
vec_result = vbase + vstride * vstride_coef;

Copy link
Contributor

github-actions bot commented Feb 1, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

simd(base, stride) calls previously were lowered into a long sequence of
INSERT and ADD operations. That sequence is replaced with a vector
equivalent:
  vbase = broadcast base
  vstride = broadcast stride
  vstride_coef = {0, 1, 2, 3, ... N-1}
  vec_result = vbase + vstride * vstride_coef;

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
@v-klochkov v-klochkov marked this pull request as ready for review February 3, 2024 03:20
@v-klochkov v-klochkov requested a review from a team as a code owner February 3, 2024 03:20
std::index_sequence<Is...>) {
return vector_type_t<T, N>{(T)(Base + ((T)Is) * Stride)...};
constexpr auto make_vector_impl(T Base, T Stride, std::index_sequence<Is...>) {
using CppT = typename element_type_traits<T>::EnclosingCppT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you considering optimizing this for low values of N, did that end up not being worth it?

Copy link
Contributor Author

@v-klochkov v-klochkov Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the initial research for float types and found such tuning worthless.

Just to answer your question here and to show the IR I used int type this time,
and found 2 cases where the old code is 1 instruction faster/shorter:

                          old {num math ops : ops}      new {num math ops : ops}
simd<int, 1>           	  0: 				0: 
simd<int, 2>           	* 1: 1xADD			2: 1xADD, 1xMUL
simd<int, 3>		* 3: 2xADD, 1xSHL      		4: 2xADD, 2xMUL (it split 3-elem vec to 2-elem vec + 1-elem vec)
simd<int, 4>		  5: 3xADD, 1xSHL, 1xMUL	2: 1xADD, 1xMUL

simd<float, 1>            0: 				0: 
simd<float, 2>            1: 1xADD			1: 1xMAD
simd<float, 3>            2: 2xADD			2: 2xMAD (3-elem vector ops were split -> 2-elem + 1-elem)
simd<float, 4>            3: 3xADD			1: 1xMAD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added few lines of code to tune for integral types and N <= 3: ea002b5

The old sequence prodices 1 less instruction in the final gpu code.

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
@v-klochkov v-klochkov merged commit e9a1ace into intel:sycl Feb 5, 2024
12 checks passed
@v-klochkov v-klochkov deleted the esimd_fix_insert_inefficiency branch February 5, 2024 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants