Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector algorithms with AVX2 masked stores and AMD processors #5068

Closed
AlexGuteniev opened this issue Nov 6, 2024 · 2 comments
Closed

Vector algorithms with AVX2 masked stores and AMD processors #5068

AlexGuteniev opened this issue Nov 6, 2024 · 2 comments
Labels
question Further information is requested resolved Successfully resolved without a commit

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Nov 6, 2024

The benchmark results #5062 (comment) seem to confirm that #5062 is a pessimization for AMD.

AVX2 mask store timings are bad on recent AMDs.

In addition to the currently in review algorithm, we have one accepted already.

Questions:

  • Should Vectorize remove_copy for 4 and 8 byte elements #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine?
  • Should vectorize replace 🎭 #4554 be reevaluated on an AMD? (can use #define _USE_STD_VECTOR_ALGORITHMS 0 escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.
  • Is this right that we don't do vendor detection using cpuid instruction?

Note that we also use masked loads, but I don't have concerns for them:

  • They are bad only on AMDs before Zen 2, see timings
  • They are used to process tails, not the whole range
@AlexGuteniev AlexGuteniev added the question Further information is requested label Nov 6, 2024
@muellerj2
Copy link
Contributor

Should #4554 be reevaluated on an AMD?

Benchmark results on a Ryzen 7840HS laptop:

unvectorized (_USE_STD_VECTOR_ALGORITHMS=0):

Benchmark Time CPU Iterations
r<std::uint32_t> 1577 ns 1430 ns 448000
r<std::uint64_t> 1779 ns 1744 ns 448000

vectorized (_USE_STD_VECTOR_ALGORITHMS=1):

Benchmark Time CPU Iterations
r<std::uint32_t> 1079 ns 1046 ns 896000
r<std::uint64_t> 1289 ns 1235 ns 746667

@StephanTLavavej StephanTLavavej added the resolved Successfully resolved without a commit label Nov 13, 2024
@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested resolved Successfully resolved without a commit
Projects
None yet
Development

No branches or pull requests

3 participants