Use AVX2 in `minmax_element` vectorization #4659

AlexGuteniev · 2024-05-07T07:24:34Z

Resolves #2803

This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).

I also notice that Both_val 8-bit and 16-bit cases are slow.
The vectorization for them is not engaged, it is a separate issue from the AVX.

Benchmark results

Before:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>              248 ns         96.3 ns      7466667
bm<uint8_t, 8021, Op::Max>              238 ns         96.9 ns      6291661
bm<uint8_t, 8021, Op::Both>             371 ns          124 ns      8432941
bm<uint8_t, 8021, Op::Min_val>          131 ns         56.6 ns     16000000
bm<uint8_t, 8021, Op::Max_val>          128 ns         67.0 ns      7466667
bm<uint8_t, 8021, Op::Both_val>        4197 ns         1611 ns       407273
bm<uint16_t, 8021, Op::Min>             459 ns          226 ns      4072727
bm<uint16_t, 8021, Op::Max>             445 ns          200 ns      3521123
bm<uint16_t, 8021, Op::Both>            685 ns          331 ns      2357895
bm<uint16_t, 8021, Op::Min_val>         247 ns          122 ns      4480000
bm<uint16_t, 8021, Op::Max_val>         252 ns          105 ns      7466667
bm<uint16_t, 8021, Op::Both_val>       4239 ns         3299 ns       203636
bm<uint32_t, 8021, Op::Min>             979 ns          364 ns      1544828
bm<uint32_t, 8021, Op::Max>             932 ns          586 ns      1120000
bm<uint32_t, 8021, Op::Both>           1439 ns          921 ns       746667
bm<uint32_t, 8021, Op::Min_val>         501 ns          321 ns      1947826
bm<uint32_t, 8021, Op::Max_val>         494 ns          414 ns      2036364
bm<uint32_t, 8021, Op::Both_val>        673 ns          500 ns      1000000
bm<uint64_t, 8021, Op::Min>            4252 ns         3128 ns       344615
bm<uint64_t, 8021, Op::Max>            4360 ns         2567 ns       280000
bm<uint64_t, 8021, Op::Both>           4397 ns         2783 ns       213333
bm<uint64_t, 8021, Op::Min_val>        3844 ns         2441 ns       320000
bm<uint64_t, 8021, Op::Max_val>        3857 ns         2518 ns       235789
bm<uint64_t, 8021, Op::Both_val>       3862 ns         2319 ns       235789
bm<int8_t, 8021, Op::Min>               246 ns          176 ns      3733333
bm<int8_t, 8021, Op::Max>               235 ns          179 ns      4977778
bm<int8_t, 8021, Op::Both>              361 ns          239 ns      3200000
bm<int8_t, 8021, Op::Min_val>           126 ns         89.3 ns     11200000
bm<int8_t, 8021, Op::Max_val>           128 ns         94.2 ns      7466667
bm<int8_t, 8021, Op::Both_val>         3842 ns         2441 ns       224000
bm<int16_t, 8021, Op::Min>              460 ns          321 ns      2968994
bm<int16_t, 8021, Op::Max>              445 ns          308 ns      2133333
bm<int16_t, 8021, Op::Both>             683 ns          517 ns      1723077
bm<int16_t, 8021, Op::Min_val>          251 ns          176 ns      6400000
bm<int16_t, 8021, Op::Max_val>          249 ns          174 ns      5575111
bm<int16_t, 8021, Op::Both_val>        3318 ns         1709 ns       320000
bm<int32_t, 8021, Op::Min>              965 ns          719 ns      1000000
bm<int32_t, 8021, Op::Max>              903 ns          628 ns      1120000
bm<int32_t, 8021, Op::Both>            1405 ns          893 ns      1120000
bm<int32_t, 8021, Op::Min_val>          497 ns          307 ns      2036364
bm<int32_t, 8021, Op::Max_val>          505 ns          377 ns      1947826
bm<int32_t, 8021, Op::Both_val>         690 ns          401 ns      1792000
bm<int64_t, 8021, Op::Min>             4466 ns         3024 ns       263529
bm<int64_t, 8021, Op::Max>             4385 ns         2860 ns       224000
bm<int64_t, 8021, Op::Both>            4845 ns         2567 ns       280000
bm<int64_t, 8021, Op::Min_val>         5156 ns         1883 ns      1120000
bm<int64_t, 8021, Op::Max_val>         4003 ns         1664 ns       497778
bm<int64_t, 8021, Op::Both_val>        3847 ns         2052 ns       464593
bm<float, 8021, Op::Min>               1965 ns          928 ns       640000
bm<float, 8021, Op::Max>               2014 ns          949 ns       560000
bm<float, 8021, Op::Both>              2254 ns         1067 ns       746667
bm<float, 8021, Op::Min_val>           1870 ns          767 ns      1120000
bm<float, 8021, Op::Max_val>           1838 ns          984 ns       746667
bm<float, 8021, Op::Both_val>          1886 ns          894 ns       768981
bm<double, 8021, Op::Min>              3931 ns         2009 ns       373333
bm<double, 8021, Op::Max>              4004 ns         1646 ns       560000
bm<double, 8021, Op::Both>             4809 ns         1842 ns       280000
bm<double, 8021, Op::Min_val>          3776 ns         1855 ns       320000
bm<double, 8021, Op::Max_val>          3769 ns         1807 ns       320000
bm<double, 8021, Op::Both_val>         3850 ns         1674 ns       448000

After:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>              170 ns         80.2 ns      8960000
bm<uint8_t, 8021, Op::Max>              170 ns         82.3 ns     11200000
bm<uint8_t, 8021, Op::Both>             295 ns         85.0 ns      7719385
bm<uint8_t, 8021, Op::Min_val>         76.7 ns         26.2 ns     29866667
bm<uint8_t, 8021, Op::Max_val>         75.4 ns         28.0 ns     40727273
bm<uint8_t, 8021, Op::Both_val>        4312 ns         1283 ns       560000
bm<uint16_t, 8021, Op::Min>             336 ns          157 ns      8960000
bm<uint16_t, 8021, Op::Max>             318 ns          146 ns      7466667
bm<uint16_t, 8021, Op::Both>            553 ns          231 ns      3446154
bm<uint16_t, 8021, Op::Min_val>         142 ns         69.8 ns     22400000
bm<uint16_t, 8021, Op::Max_val>         139 ns         68.8 ns     10000000
bm<uint16_t, 8021, Op::Both_val>       4306 ns         2574 ns       248889
bm<uint32_t, 8021, Op::Min>             615 ns          363 ns      3009055
bm<uint32_t, 8021, Op::Max>             623 ns          325 ns      2357895
bm<uint32_t, 8021, Op::Both>           1071 ns          609 ns      1000000
bm<uint32_t, 8021, Op::Min_val>         258 ns         64.3 ns     12389136
bm<uint32_t, 8021, Op::Max_val>         258 ns         58.6 ns     11200000
bm<uint32_t, 8021, Op::Both_val>        374 ns          100 ns     10000000
bm<uint64_t, 8021, Op::Min>            3540 ns          663 ns       896000
bm<uint64_t, 8021, Op::Max>            3468 ns          922 ns      1000000
bm<uint64_t, 8021, Op::Both>           4271 ns          907 ns      1120000
bm<uint64_t, 8021, Op::Min_val>        2917 ns          680 ns       896000
bm<uint64_t, 8021, Op::Max_val>        2974 ns          797 ns      1000000
bm<uint64_t, 8021, Op::Both_val>       3090 ns          893 ns      1120000
bm<int8_t, 8021, Op::Min>               177 ns         38.1 ns     16000000
bm<int8_t, 8021, Op::Max>               177 ns         48.1 ns     14933333
bm<int8_t, 8021, Op::Both>              288 ns         85.4 ns      6400000
bm<int8_t, 8021, Op::Min_val>          77.4 ns         29.9 ns     20363636
bm<int8_t, 8021, Op::Max_val>          74.1 ns         22.6 ns     37333333
bm<int8_t, 8021, Op::Both_val>         3843 ns         1086 ns       719369
bm<int16_t, 8021, Op::Min>              339 ns         69.8 ns      8960000
bm<int16_t, 8021, Op::Max>              321 ns         73.4 ns     10000000
bm<int16_t, 8021, Op::Both>             553 ns          141 ns     11200000
bm<int16_t, 8021, Op::Min_val>          140 ns         49.8 ns     16000000
bm<int16_t, 8021, Op::Max_val>          139 ns         45.6 ns     20906668
bm<int16_t, 8021, Op::Both_val>        3377 ns          628 ns       746667
bm<int32_t, 8021, Op::Min>              620 ns          178 ns      7466667
bm<int32_t, 8021, Op::Max>              625 ns          159 ns      5120000
bm<int32_t, 8021, Op::Both>            1059 ns          261 ns      2635294
bm<int32_t, 8021, Op::Min_val>          254 ns         62.5 ns     10000000
bm<int32_t, 8021, Op::Max_val>          254 ns         64.5 ns      8960000
bm<int32_t, 8021, Op::Both_val>         379 ns          100 ns     11200000
bm<int64_t, 8021, Op::Min>             3532 ns         1287 ns       497778
bm<int64_t, 8021, Op::Max>             3462 ns         1221 ns       448000
bm<int64_t, 8021, Op::Both>            4130 ns         1953 ns       640000
bm<int64_t, 8021, Op::Min_val>         3011 ns         1726 ns       298667
bm<int64_t, 8021, Op::Max_val>         2945 ns         1632 ns       497778
bm<int64_t, 8021, Op::Both_val>        3264 ns         1378 ns       669014
bm<float, 8021, Op::Min>               1176 ns          401 ns      1947826
bm<float, 8021, Op::Max>               1208 ns          750 ns      1000000
bm<float, 8021, Op::Both>              1358 ns         1151 ns       746667
bm<float, 8021, Op::Min_val>            894 ns          802 ns       896000
bm<float, 8021, Op::Max_val>            891 ns          310 ns      4683093
bm<float, 8021, Op::Both_val>           949 ns          844 ns      1000000
bm<double, 8021, Op::Min>              2230 ns         1918 ns       448000
bm<double, 8021, Op::Max>              2421 ns         2134 ns       373333
bm<double, 8021, Op::Both>             2765 ns         2532 ns       407273
bm<double, 8021, Op::Min_val>          1860 ns         1674 ns       448000
bm<double, 8021, Op::Max_val>          1869 ns         1500 ns       448000
bm<double, 8021, Op::Both_val>         1939 ns         1381 ns       407273

AlexGuteniev · 2024-05-07T10:52:58Z

I also dropped the bool _Unused.

With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that.

for at least minmax

AlexGuteniev · 2024-05-09T08:36:01Z

Results as a table

Benchmark	Before	After
bm<uint8_t, 8021, Op::Min>	248 ns	170 ns
bm<uint8_t, 8021, Op::Max>	238 ns	170 ns
bm<uint8_t, 8021, Op::Both>	371 ns	295 ns
bm<uint8_t, 8021, Op::Min_val>	131 ns	76.7 ns
bm<uint8_t, 8021, Op::Max_val>	128 ns	75.4 ns
bm<uint8_t, 8021, Op::Both_val>	4197 ns	4312 ns
bm<uint16_t, 8021, Op::Min>	459 ns	336 ns
bm<uint16_t, 8021, Op::Max>	445 ns	318 ns
bm<uint16_t, 8021, Op::Both>	685 ns	553 ns
bm<uint16_t, 8021, Op::Min_val>	247 ns	142 ns
bm<uint16_t, 8021, Op::Max_val>	252 ns	139 ns
bm<uint16_t, 8021, Op::Both_val>	4239 ns	4306 ns
bm<uint32_t, 8021, Op::Min>	979 ns	615 ns
bm<uint32_t, 8021, Op::Max>	932 ns	623 ns
bm<uint32_t, 8021, Op::Both>	1439 ns	1071 ns
bm<uint32_t, 8021, Op::Min_val>	501 ns	258 ns
bm<uint32_t, 8021, Op::Max_val>	494 ns	258 ns
bm<uint32_t, 8021, Op::Both_val>	673 ns	374 ns
bm<uint64_t, 8021, Op::Min>	4252 ns	3540 ns
bm<uint64_t, 8021, Op::Max>	4360 ns	3468 ns
bm<uint64_t, 8021, Op::Both>	4397 ns	4271 ns
bm<uint64_t, 8021, Op::Min_val>	3844 ns	2917 ns
bm<uint64_t, 8021, Op::Max_val>	3857 ns	2974 ns
bm<uint64_t, 8021, Op::Both_val>	3862 ns	3090 ns
bm<int8_t, 8021, Op::Min>	246 ns	177 ns
bm<int8_t, 8021, Op::Max>	235 ns	177 ns
bm<int8_t, 8021, Op::Both>	361 ns	288 ns
bm<int8_t, 8021, Op::Min_val>	126 ns	77.4 ns
bm<int8_t, 8021, Op::Max_val>	128 ns	74.1 ns
bm<int8_t, 8021, Op::Both_val>	3842 ns	3843 ns
bm<int16_t, 8021, Op::Min>	460 ns	339 ns
bm<int16_t, 8021, Op::Max>	445 ns	321 ns
bm<int16_t, 8021, Op::Both>	683 ns	553 ns
bm<int16_t, 8021, Op::Min_val>	251 ns	140 ns
bm<int16_t, 8021, Op::Max_val>	249 ns	139 ns
bm<int16_t, 8021, Op::Both_val>	3318 ns	3377 ns
bm<int32_t, 8021, Op::Min>	965 ns	620 ns
bm<int32_t, 8021, Op::Max>	903 ns	625 ns
bm<int32_t, 8021, Op::Both>	1405 ns	1059 ns
bm<int32_t, 8021, Op::Min_val>	497 ns	254 ns
bm<int32_t, 8021, Op::Max_val>	505 ns	254 ns
bm<int32_t, 8021, Op::Both_val>	690 ns	379 ns
bm<int64_t, 8021, Op::Min>	4466 ns	3532 ns
bm<int64_t, 8021, Op::Max>	4385 ns	3462 ns
bm<int64_t, 8021, Op::Both>	4845 ns	4130 ns
bm<int64_t, 8021, Op::Min_val>	5156 ns	3011 ns
bm<int64_t, 8021, Op::Max_val>	4003 ns	2945 ns
bm<int64_t, 8021, Op::Both_val>	3847 ns	3264 ns
bm<float, 8021, Op::Min>	1965 ns	1176 ns
bm<float, 8021, Op::Max>	2014 ns	1208 ns
bm<float, 8021, Op::Both>	2254 ns	1358 ns
bm<float, 8021, Op::Min_val>	1870 ns	894 ns
bm<float, 8021, Op::Max_val>	1838 ns	891 ns
bm<float, 8021, Op::Both_val>	1886 ns	949 ns
bm<double, 8021, Op::Min>	3931 ns	2230 ns
bm<double, 8021, Op::Max>	4004 ns	2421 ns
bm<double, 8021, Op::Both>	4809 ns	2765 ns
bm<double, 8021, Op::Min_val>	3776 ns	1860 ns
bm<double, 8021, Op::Max_val>	3769 ns	1869 ns
bm<double, 8021, Op::Both_val>	3850 ns	1939 ns

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-06-14T03:13:09Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-06-18T05:01:04Z

AVX2 fast, AVX2 furious! 🚗 🚙 🏎️

AlexGuteniev added 5 commits May 6, 2024 20:51

additional dispatch level, separate SSE traits

ff5e762

additional dispatch does dispatch, scalar traits

ff019c1

move sse specifics to sse traits

cd2b29f

implement AVX optimization

560da3d

float indices reuse integer indices

a21d2df

AlexGuteniev requested a review from a team as a code owner May 7, 2024 07:24

template scalar traits

18a919f

AlexGuteniev mentioned this pull request May 7, 2024

minmax 8 and 16 bit elements are not vectorized #4660

Closed

Drop the fuse, it no longer works

0ec33d9

AlexGuteniev added 5 commits May 7, 2024 14:05

-newline

e05e57f

-const unused

e860b8b

This works for a wrong reason

f7d813b

More reuse!

2bf5931

Nearly forgot that one again!

5f045b2

StephanTLavavej added the performance Must go faster label May 7, 2024

StephanTLavavej self-assigned this May 7, 2024

AlexGuteniev added 2 commits May 8, 2024 08:24

fix ARM64EC build

5380995

make preprocessor comments consistent

6bee3f0

for at least minmax

AlexGuteniev mentioned this pull request May 13, 2024

vector_algorithms.cpp: minmax for 64-bit elements: replace ugly x86 workaround with a nice one #4661

Merged

AlexGuteniev added 3 commits May 13, 2024 09:14

Avoid extra variable

08e84f8

fix up previous change

7aa4da0

Merge remote-tracking branch 'upstream/main' into max_avx

7cb364f

StephanTLavavej requested changes Jun 7, 2024

View reviewed changes

StephanTLavavej removed their assignment Jun 7, 2024

StephanTLavavej mentioned this pull request Jun 7, 2024

Maintainer priorities #4700

Open

AlexGuteniev added 3 commits June 7, 2024 14:10

right AVX2 vpermq

8a024d6

-newline

6db0f49

unrolled and unchained

fbbad86

AlexGuteniev added 3 commits June 7, 2024 14:12

unused constant

97e670e

+newline

567eb8c

const

7bf105d

AlexGuteniev requested a review from StephanTLavavej June 7, 2024 11:15

Two more newlines.

d8df820

StephanTLavavej approved these changes Jun 7, 2024

View reviewed changes

StephanTLavavej self-assigned this Jun 14, 2024

StephanTLavavej merged commit f608853 into microsoft:main Jun 18, 2024
39 checks passed

AlexGuteniev deleted the max_avx branch June 18, 2024 05:30

StephanTLavavej mentioned this pull request Jun 19, 2024

Work around a compiler back-end assertion in vectorized minmax #4739

Merged

AlexGuteniev mentioned this pull request Jun 21, 2024

Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only #4734

Merged

hiraditya mentioned this pull request Oct 15, 2024

Vectorize minmax_element. llvm/llvm-project#112397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AVX2 in `minmax_element` vectorization #4659

Use AVX2 in `minmax_element` vectorization #4659

AlexGuteniev commented May 7, 2024

AlexGuteniev commented May 7, 2024

AlexGuteniev commented May 9, 2024

StephanTLavavej commented Jun 14, 2024

StephanTLavavej commented Jun 18, 2024

Use AVX2 in minmax_element vectorization #4659

Use AVX2 in minmax_element vectorization #4659

Conversation

AlexGuteniev commented May 7, 2024

AlexGuteniev commented May 7, 2024

AlexGuteniev commented May 9, 2024

StephanTLavavej commented Jun 14, 2024

StephanTLavavej commented Jun 18, 2024

Use AVX2 in `minmax_element` vectorization #4659

Use AVX2 in `minmax_element` vectorization #4659