A C++ implementation of blas general matrix-matrix multiplication routines (xgemm).
The library is header only, you just have to copy the files located in the
include
folder to use it.
In order to build the tests/benchmarks, you will need openblas installed on your system.
The tests and benchmarks can then be built and ran with CMake:
$ cmake -S . -B build -DGEMM_BUILD_TEST=ON -DGEMM_BUILD_BENCHMARK=ON # -DGEMM_USE_CTEST=ON
$ cmake --build build --target test --target benchmark
$ ./build/test/test
$ ./build/benchmark/benchmark --reporter=benchmark
or with xmake:
$ xmake
$ xmake build
$ xmake run test
$ xmake run benchmark --reporter=benchmark
Some benchmark results are available here, they were
ran on a AMD Ryzen 5 3500u cpu, locked at 2100MHz. The program was compiled
with GCC 13.1.1
and the -O3 -march=native
options.
In the benchmarks, the floating point version gemm<float>
is run against
openblas' sgemm
and, for small enough matrices, a naive algorithm (with the
loop swapping optimization).
The number of cycles per computed matrix entries can be plotted by running the following commands:
$ ./build/benchmark/benchmark --reporter=plot::out=metrics.json
$ python3 benchmark/plot.py metrics.json -o plot.svg
It gives the following plots for the benchmark run mentionned above:
- Integrate the kernels to the main matrix multiplication function
- Have a working implementation
- Fix performance issues
- Microkernels
- Kernel composition function
- 1x(1, 2, 4, 8)x(1, 2, 4, 8) kernels
- 2x(1, 2, 4, 8)x(1, 2, 4, 8) kernels
- 4x(1, 2, 4, 8)x(1, 2, 4, 8) kernels
- 8x(1, 2, 4, 8)x(1, 2, 4, 8) kernels
- Tests
- Kernels
- Big matrices
- Fix precision issues
- Double
- Benchmarks
- Small matrices
- Big matrices
- Double
- Plots