-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use thread-local register file for matmul speedups #205
Conversation
adding back the BK check
and some template param tuning
A is laid out in column major in global mem
it's now outside the kernel, so we check only once instead of however many times. However, this bloats binary size. also changed template parameters a bit
since we know that As and Bs are laid out one after another in memory (ie they are basically svals), and the overalls dimension is (BM + BN) * BK, we just write one "nested" loop that does the zeroing
they're not used in the __global__ functions anyway
it's now a specialization of matmul_block2d
summary: there appears to be a nice performance improvement, but the synchronization error needs to be found (and if within tinyblas, fixed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, tinyBLAS is now outperforming rocBLAS on the graphics card for which we chose the tuning parameters earlier in d7cbaf7. See https://justine.lol/tinyblas-testing.txt which shows some quick testing across platforms versions and hardware.
I understand that a determinism issue slipped through in a previous change, but that text file should show that output still appears to be coherent, we're crushing cpu at inference, and finally determinism issues are nothing out of the ordinary for llama.cpp.
It's hard to find an objective yardstick with LLMs, especially given our resources, but we're doing the best we can. If there's a subtle issue with undefined behavior in the tinyBLAS code, then we'll nail it soon enough, and I think our users will be happy to have the performance in the meantime while we figure out the subtleties.
With this change we now more closely follow the
2D Blocktiling
kernel from https://siboehm.com/articles/22/CUDA-MMM, but with additional bounds checking and parameter tuning for our use case. Other details:TM
andTN
for allocating thread-local storageTM x TN
rather than a sub-row of sizeBN
static_assert
s to ensure the template parameters are okmatmul_block2d
operates completely withhalf
valuesAs
is now loaded as column major because that helps filling outAt
When tested on examples, This change speeds up
prompt eval time
by around 1.5x and slightly improveseval time
for tinyBLAS (on CUDA).However, while writing this change I found that there is a small synchronization error with tinyBLAS (on CUDA), possibly two threads writing to the same spot. which causes non-determinism in the output even when
--temp 0
and the seed is fixed. Hence I've opened it as a draft PR. If someone can help figure out where this error is happening, that would be nice.