Optimize AVX512 parallel DGEMM performance #2646

wjc404 · 2020-06-06T07:01:52Z

Reduce the amount of vector load instructions (from "12 vbroadcastsd + 2 vmovupd" to "6 vbroadcastf32x4 + 4 vmovddup" per iteration) to minimize power consumption and thermal throttling.

Performance of DGEMM with 26 threads on Intel Xeon Platinum 8269CY (transa = transb = N):

Dimension	GFLOPS-OpenBLAS-old	GFLOPS-OpenBLAS-new	GFLOPS-MKL2019
10000	1348	1396	1527
20000	1401	1443	1551
30000	1400	1449	1555

Add files via upload

0e3ac4a

martin-frbg added this to the 0.3.10 milestone Jun 7, 2020

martin-frbg merged commit c3574ff into OpenMathLib:develop Jun 7, 2020

martin-frbg mentioned this pull request Jul 3, 2020

PyPy + NumPy + SVX causes a segfault #2705

Closed

martin-frbg mentioned this pull request Aug 27, 2020

LU and eigen routines slower than MKL #2795

Open

loveshack mentioned this pull request Sep 1, 2020

SKX throttling possible improvement flame/blis#441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AVX512 parallel DGEMM performance #2646

Optimize AVX512 parallel DGEMM performance #2646

wjc404 commented Jun 6, 2020 •

edited

Loading

Optimize AVX512 parallel DGEMM performance #2646

Optimize AVX512 parallel DGEMM performance #2646

Conversation

wjc404 commented Jun 6, 2020 • edited Loading

wjc404 commented Jun 6, 2020 •

edited

Loading