Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use thread-local register file for matmul speedups #205

Merged
merged 12 commits into from
Jan 18, 2024

Commits on Jan 16, 2024

  1. Configuration menu
    Copy the full SHA
    6eb9303 View commit details
    Browse the repository at this point in the history
  2. remove the BK check

    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    1413c3a View commit details
    Browse the repository at this point in the history
  3. fix memory error with sgemm

    adding back the BK check
    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    d33de16 View commit details
    Browse the repository at this point in the history
  4. separate reading A and B

    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    8b0228c View commit details
    Browse the repository at this point in the history
  5. writeback to C in one go

    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    184d203 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    61946c3 View commit details
    Browse the repository at this point in the history
  7. use half everywhere for matmul_block2d

    and some template param tuning
    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    e05b434 View commit details
    Browse the repository at this point in the history
  8. read A into As as column major

    A is laid out in column major in global mem
    ahgamut committed Jan 16, 2024
    Configuration menu
    Copy the full SHA
    a314754 View commit details
    Browse the repository at this point in the history

Commits on Jan 18, 2024

  1. moving Ctype check into compile-time

    it's now outside the kernel, so we check only once instead of however
    many times. However, this bloats binary size.
    
    also changed template parameters a bit
    ahgamut committed Jan 18, 2024
    Configuration menu
    Copy the full SHA
    44cc475 View commit details
    Browse the repository at this point in the history
  2. separate zeroing out and reading from gmem

    since we know that As and Bs are laid out one after another in memory
    (ie they are basically svals), and the overalls dimension is (BM + BN) *
    BK, we just write one "nested" loop that does the zeroing
    ahgamut committed Jan 18, 2024
    Configuration menu
    Copy the full SHA
    13bab2c View commit details
    Browse the repository at this point in the history
  3. move sharedmem pointers into __device__ matmul

    they're not used in the __global__ functions anyway
    ahgamut committed Jan 18, 2024
    Configuration menu
    Copy the full SHA
    494fb8e View commit details
    Browse the repository at this point in the history
  4. remove matmul32_block2d

    it's now a specialization of matmul_block2d
    ahgamut committed Jan 18, 2024
    Configuration menu
    Copy the full SHA
    fc8245a View commit details
    Browse the repository at this point in the history