You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Should vectorize replace 🎭 #4554 be reevaluated on an AMD? (can use #define _USE_STD_VECTOR_ALGORITHMS 0 escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.
Is this right that we don't do vendor detection using cpuid instruction?
Note that we also use masked loads, but I don't have concerns for them:
They are bad only on AMDs before Zen 2, see timings
They are used to process tails, not the whole range
The text was updated successfully, but these errors were encountered:
Thanks @muellerj2 for checking vectorize replace 🎭 #4554 on Zen 4. Although I was a bad kitty and didn't benchmark that PR on my Zen 3 before merging, it looks like this isn't a pessimization, so I got away with it 😹
At this time, we prefer to avoid vendor-specific logic in the STL.
The benchmark results #5062 (comment) seem to confirm that #5062 is a pessimization for AMD.
AVX2 mask store timings are bad on recent AMDs.
In addition to the currently in review algorithm, we have one accepted already.
Questions:
remove_copy
for 4 and 8 byte elements #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine?replace
🎭 #4554 be reevaluated on an AMD? (can use#define _USE_STD_VECTOR_ALGORITHMS 0
escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.cpuid
instruction?Note that we also use masked loads, but I don't have concerns for them:
The text was updated successfully, but these errors were encountered: