-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use thread-local register file for matmul speedups #205
Commits on Jan 16, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 6eb9303 - Browse repository at this point
Copy the full SHA 6eb9303View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1413c3a - Browse repository at this point
Copy the full SHA 1413c3aView commit details -
Configuration menu - View commit details
-
Copy full SHA for d33de16 - Browse repository at this point
Copy the full SHA d33de16View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8b0228c - Browse repository at this point
Copy the full SHA 8b0228cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 184d203 - Browse repository at this point
Copy the full SHA 184d203View commit details -
Configuration menu - View commit details
-
Copy full SHA for 61946c3 - Browse repository at this point
Copy the full SHA 61946c3View commit details -
use half everywhere for matmul_block2d
and some template param tuning
Configuration menu - View commit details
-
Copy full SHA for e05b434 - Browse repository at this point
Copy the full SHA e05b434View commit details -
read A into As as column major
A is laid out in column major in global mem
Configuration menu - View commit details
-
Copy full SHA for a314754 - Browse repository at this point
Copy the full SHA a314754View commit details
Commits on Jan 18, 2024
-
moving Ctype check into compile-time
it's now outside the kernel, so we check only once instead of however many times. However, this bloats binary size. also changed template parameters a bit
Configuration menu - View commit details
-
Copy full SHA for 44cc475 - Browse repository at this point
Copy the full SHA 44cc475View commit details -
separate zeroing out and reading from gmem
since we know that As and Bs are laid out one after another in memory (ie they are basically svals), and the overalls dimension is (BM + BN) * BK, we just write one "nested" loop that does the zeroing
Configuration menu - View commit details
-
Copy full SHA for 13bab2c - Browse repository at this point
Copy the full SHA 13bab2cView commit details -
move sharedmem pointers into __device__ matmul
they're not used in the __global__ functions anyway
Configuration menu - View commit details
-
Copy full SHA for 494fb8e - Browse repository at this point
Copy the full SHA 494fb8eView commit details -
Configuration menu - View commit details
-
Copy full SHA for fc8245a - Browse repository at this point
Copy the full SHA fc8245aView commit details