Vector algorithms with AVX2 masked stores and AMD processors #5068

AlexGuteniev · 2024-11-06T20:24:48Z

The benchmark results #5062 (comment) seem to confirm that #5062 is a pessimization for AMD.

AVX2 mask store timings are bad on recent AMDs.

In addition to the currently in review algorithm, we have one accepted already.

Questions:

Should Vectorize remove_copy for 4 and 8 byte elements #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine?
Should vectorize replace 🎭 #4554 be reevaluated on an AMD? (can use #define _USE_STD_VECTOR_ALGORITHMS 0 escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.
Is this right that we don't do vendor detection using cpuid instruction?

Note that we also use masked loads, but I don't have concerns for them:

The text was updated successfully, but these errors were encountered:

muellerj2 · 2024-11-07T07:33:58Z

Should #4554 be reevaluated on an AMD?

Benchmark results on a Ryzen 7840HS laptop:

unvectorized (_USE_STD_VECTOR_ALGORITHMS=0):

Benchmark	Time	CPU	Iterations
r<std::uint32_t>	1577 ns	1430 ns	448000
r<std::uint64_t>	1779 ns	1744 ns	448000

vectorized (_USE_STD_VECTOR_ALGORITHMS=1):

Benchmark	Time	CPU	Iterations
r<std::uint32_t>	1079 ns	1046 ns	896000
r<std::uint64_t>	1289 ns	1235 ns	746667

StephanTLavavej · 2024-11-13T22:15:50Z

We talked about this at the weekly maintainer meeting:

We decided to close Vectorize remove_copy for 4 and 8 byte elements #5062 without merging.
Thanks @muellerj2 for checking vectorize replace 🎭 #4554 on Zen 4. Although I was a bad kitty and didn't benchmark that PR on my Zen 3 before merging, it looks like this isn't a pessimization, so I got away with it 😹
At this time, we prefer to avoid vendor-specific logic in the STL.

AlexGuteniev added the question Further information is requested label Nov 6, 2024

StephanTLavavej added the resolved Successfully resolved without a commit label Nov 13, 2024

StephanTLavavej closed this as completed Nov 13, 2024

Provide feedback