use thread-local register file for matmul speedups #205

ahgamut · 2024-01-16T01:12:54Z

With this change we now more closely follow the 2D Blocktiling kernel from https://siboehm.com/articles/22/CUDA-MMM, but with additional bounds checking and parameter tuning for our use case. Other details:

template parameters TM and TN for allocating thread-local storage
each thread now calculates values for a sub-matrix of size TM x TN rather than a sub-row of size BN
some static_asserts to ensure the template parameters are ok
matmul_block2d operates completely with half values
As is now loaded as column major because that helps filling out At

When tested on examples, This change speeds up prompt eval time by around 1.5x and slightly improves eval time for tinyBLAS (on CUDA).

However, while writing this change I found that there is a small synchronization error with tinyBLAS (on CUDA), possibly two threads writing to the same spot. which causes non-determinism in the output even when --temp 0 and the seed is fixed. Hence I've opened it as a draft PR. If someone can help figure out where this error is happening, that would be nice.

adding back the BK check

and some template param tuning

A is laid out in column major in global mem

jart · 2024-01-16T02:12:21Z

it's now outside the kernel, so we check only once instead of however many times. However, this bloats binary size. also changed template parameters a bit

since we know that As and Bs are laid out one after another in memory (ie they are basically svals), and the overalls dimension is (BM + BN) * BK, we just write one "nested" loop that does the zeroing

they're not used in the __global__ functions anyway

it's now a specialization of matmul_block2d

ahgamut · 2024-01-18T17:35:43Z

ok, this PR does improve performance, but there is a synchronization error that is causing some non-determinism in the output (my guess is it's due to one of f6ee33c , c2bc6e6 , c0589f0 -- the commits from before 2D blocking don't seem to have this issue).

ahgamut · 2024-01-18T17:41:06Z

here's the performance of this patch compared to main. most of the speed improvements are due to picking better BM/BN/BK values -- perhaps the benefits can be confirmed on other CUDA/AMD targets.

ahgamut · 2024-01-18T17:42:41Z

summary: there appears to be a nice performance improvement, but the synchronization error needs to be found (and if within tinyblas, fixed).

jart

With this change, tinyBLAS is now outperforming rocBLAS on the graphics card for which we chose the tuning parameters earlier in d7cbaf7. See https://justine.lol/tinyblas-testing.txt which shows some quick testing across platforms versions and hardware.

I understand that a determinism issue slipped through in a previous change, but that text file should show that output still appears to be coherent, we're crushing cpu at inference, and finally determinism issues are nothing out of the ordinary for llama.cpp.

It's hard to find an objective yardstick with LLMs, especially given our resources, but we're doing the best we can. If there's a subtle issue with undefined behavior in the tinyBLAS code, then we'll nail it soon enough, and I think our users will be happy to have the performance in the meantime while we figure out the subtleties.

ahgamut added 8 commits January 15, 2024 19:14

attempt thread-local variables for SGEMM

6eb9303

remove the BK check

1413c3a

fix memory error with sgemm

d33de16

adding back the BK check

separate reading A and B

8b0228c

writeback to C in one go

184d203

update matmul_block2d to also use TM/TN

61946c3

use half everywhere for matmul_block2d

e05b434

and some template param tuning

read A into As as column major

a314754

A is laid out in column major in global mem

jart mentioned this pull request Jan 16, 2024

ggml : get rid of BLAS and all it's variants ggerganov/ggml#293

Open

ahgamut added 4 commits January 18, 2024 04:18

moving Ctype check into compile-time

44cc475

it's now outside the kernel, so we check only once instead of however many times. However, this bloats binary size. also changed template parameters a bit

separate zeroing out and reading from gmem

13bab2c

since we know that As and Bs are laid out one after another in memory (ie they are basically svals), and the overalls dimension is (BM + BN) * BK, we just write one "nested" loop that does the zeroing

move sharedmem pointers into __device__ matmul

494fb8e

they're not used in the __global__ functions anyway

remove matmul32_block2d

fc8245a

it's now a specialization of matmul_block2d

ahgamut marked this pull request as ready for review January 18, 2024 17:41

jart approved these changes Jan 18, 2024

View reviewed changes

jart merged commit df0b3ff into Mozilla-Ocho:main Jan 18, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use thread-local register file for matmul speedups #205

use thread-local register file for matmul speedups #205

ahgamut commented Jan 16, 2024

jart commented Jan 16, 2024

ahgamut commented Jan 18, 2024

ahgamut commented Jan 18, 2024

ahgamut commented Jan 18, 2024

jart left a comment

use thread-local register file for matmul speedups #205

use thread-local register file for matmul speedups #205

Conversation

ahgamut commented Jan 16, 2024

jart commented Jan 16, 2024

ahgamut commented Jan 18, 2024

ahgamut commented Jan 18, 2024

ahgamut commented Jan 18, 2024

jart left a comment

Choose a reason for hiding this comment