Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use thread-local register file for matmul speedups #205

Merged
merged 12 commits into from
Jan 18, 2024

Conversation

ahgamut
Copy link
Contributor

@ahgamut ahgamut commented Jan 16, 2024

With this change we now more closely follow the 2D Blocktiling kernel from https://siboehm.com/articles/22/CUDA-MMM, but with additional bounds checking and parameter tuning for our use case. Other details:

  • template parameters TM and TN for allocating thread-local storage
  • each thread now calculates values for a sub-matrix of size TM x TN rather than a sub-row of size BN
  • some static_asserts to ensure the template parameters are ok
  • matmul_block2d operates completely with half values
  • As is now loaded as column major because that helps filling out At

When tested on examples, This change speeds up prompt eval time by around 1.5x and slightly improves eval time for tinyBLAS (on CUDA).

However, while writing this change I found that there is a small synchronization error with tinyBLAS (on CUDA), possibly two threads writing to the same spot. which causes non-determinism in the output even when --temp 0 and the seed is fixed. Hence I've opened it as a draft PR. If someone can help figure out where this error is happening, that would be nice.

@jart
Copy link
Collaborator

jart commented Jan 16, 2024

image

it's now outside the kernel, so we check only once instead of however
many times. However, this bloats binary size.

also changed template parameters a bit
since we know that As and Bs are laid out one after another in memory
(ie they are basically svals), and the overalls dimension is (BM + BN) *
BK, we just write one "nested" loop that does the zeroing
they're not used in the __global__ functions anyway
it's now a specialization of matmul_block2d
@ahgamut
Copy link
Contributor Author

ahgamut commented Jan 18, 2024

ok, this PR does improve performance, but there is a synchronization error that is causing some non-determinism in the output (my guess is it's due to one of f6ee33c , c2bc6e6 , c0589f0 -- the commits from before 2D blocking don't seem to have this issue).

@ahgamut
Copy link
Contributor Author

ahgamut commented Jan 18, 2024

here's the performance of this patch compared to main. most of the speed improvements are due to picking better BM/BN/BK values -- perhaps the benefits can be confirmed on other CUDA/AMD targets.

image

@ahgamut ahgamut marked this pull request as ready for review January 18, 2024 17:41
@ahgamut
Copy link
Contributor Author

ahgamut commented Jan 18, 2024

summary: there appears to be a nice performance improvement, but the synchronization error needs to be found (and if within tinyblas, fixed).

Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, tinyBLAS is now outperforming rocBLAS on the graphics card for which we chose the tuning parameters earlier in d7cbaf7. See https://justine.lol/tinyblas-testing.txt which shows some quick testing across platforms versions and hardware.

I understand that a determinism issue slipped through in a previous change, but that text file should show that output still appears to be coherent, we're crushing cpu at inference, and finally determinism issues are nothing out of the ordinary for llama.cpp.

It's hard to find an objective yardstick with LLMs, especially given our resources, but we're doing the best we can. If there's a subtle issue with undefined behavior in the tinyBLAS code, then we'll nail it soon enough, and I think our users will be happy to have the performance in the meantime while we figure out the subtleties.

@jart jart merged commit df0b3ff into Mozilla-Ocho:main Jan 18, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants