Add alternative strategy for batched matrix multiplication #51

robertknight · 2024-02-04T23:00:10Z

Previously batched matrix multiplication was handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one gemm call per matrix in the output shape. This is inefficient if the A input has only a small number of rows (as in #50). This PR implements a new strategy for the MatMul operator when A is a batch and B is a single matrix, by reshaping the inputs so that instead of many low-arithmetic intensity gemm calls, a single higher-arithmetic intensity call is performed. The output is then reshaped to restore the batch dimensions.

Testing with the benchmark added here and slight variations, the new method is a big improvement when M <= 8, a modest win for M ~ 8-24 and is roughly even, or a very slight win after that. The AVX kernel has MR=6, so this seems as-expected.

See #50

Batched matrix multiplication was handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one `gemm` call per matrix in the output shape. This can be inefficient the LHS input has a small number of rows. For example in [1], the LHS / "A" input is a row vector. In the case where the "A" input is a batch and the "B" input is a single matrix, the "A" input can be reshaped so a single gemm call can be used, with the output reshaped afterwards to restore the batch dimensions. Implement this alternate approach and add a simple benchmark for batched matmul. [1] #50

Refactor MatMul tests into a single table-driven test that has cases for when neither, one or both of the inputs is a batch. Also add tests for various invalid inputs.

robertknight force-pushed the optimize-batched-matmul branch from b142348 to 0a7344f Compare February 5, 2024 08:58

robertknight mentioned this pull request Feb 5, 2024

Comparison with tract #50

Closed

Pre-alloc result buffer in benchmark helper

13772c3

robertknight force-pushed the optimize-batched-matmul branch from 0a7344f to 74f13d9 Compare February 6, 2024 09:37

robertknight marked this pull request as ready for review February 6, 2024 20:22

robertknight added 2 commits February 6, 2024 20:30

Improve MatMul tests

e33f6a8

Refactor MatMul tests into a single table-driven test that has cases for when neither, one or both of the inputs is a batch. Also add tests for various invalid inputs.

robertknight force-pushed the optimize-batched-matmul branch from abeda07 to e33f6a8 Compare February 6, 2024 20:33

robertknight merged commit 9047942 into main Feb 6, 2024
2 checks passed

robertknight deleted the optimize-batched-matmul branch February 6, 2024 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternative strategy for batched matrix multiplication #51

Add alternative strategy for batched matrix multiplication #51

robertknight commented Feb 4, 2024 •

edited

Loading

Add alternative strategy for batched matrix multiplication #51

Add alternative strategy for batched matrix multiplication #51

Conversation

robertknight commented Feb 4, 2024 • edited Loading

robertknight commented Feb 4, 2024 •

edited

Loading