-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use AVX2 in minmax_element
vectorization
#4659
Conversation
I also dropped the With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that. |
for at least minmax
Results as a table
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
AVX2 fast, AVX2 furious! 🚗 🚙 🏎️ |
Resolves #2803
This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).
I also notice that
Both_val
8-bit and 16-bit cases are slow.The vectorization for them is not engaged, it is a separate issue from the AVX.
Benchmark results
Before:After: