Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Question On Reusing MMA Operands in Shared Memory #744

Closed
jfc4050 opened this issue Dec 23, 2022 · 5 comments
Closed

[QST] Question On Reusing MMA Operands in Shared Memory #744

jfc4050 opened this issue Dec 23, 2022 · 5 comments
Labels
question Question

Comments

@jfc4050
Copy link

jfc4050 commented Dec 23, 2022

Hello!

I'm looking to optimize an algorithm that does the following operations:

  1. Out1 = Aij.T @ Bi
  2. Out2 = Bi @ Cij.T

implementing this as back-to-back threadblock level GEMMs has the problem of loading Bi from global memory twice. Bi usually has <= 128 cols, so it should be possible to store the entire (block_rows, total_cols) Bi matrix in shared memory.

I have taken a look at the B2B GEMM examples that do the following

  1. Out1 = A @ B
  2. Out2 = Out1 @ C

I do see how I could extend that idea by loading a slice of B into shared memory first, then perform Aij.T @ Bi and Bi @ Cij.T, but it seems like it would be more efficient to allow Bi to be loaded to shmem in a pipelined/multistage manner during the computation for Out1, then reuse the tile of Bi in the next matmul B @ C. is that a reasonable guess?

I'm trying to prototype this now, but not quite sure what shared memory layout to use for Bi, since it would need to be the same for both Operand A of the first MMA and operand B of the second MMA. Would there be any issues if i just made them both in row-major (ie smem_iterator_B1 stores in row-major and warp_smem_iterator_B1/warp_smem_iterator_A2 reads from row-major)? looks like the default MMA implementations typically specify smem layouts like RowMajorTensorOpMultiplicandCongruous, with different Crosswise values depending on if its operand A or B

@jfc4050 jfc4050 changed the title [QST] Restrictions on Smem Tile Layout [QST] Question On Reusing MMA Operands in Shared Memory Dec 23, 2022
@hwu36
Copy link
Collaborator

hwu36 commented Dec 24, 2022

is it possible just using dual gemm from example 45 (https://github.com/NVIDIA/cutlass/tree/master/examples/45_dual_gemm) which is written by @danthe3rd?

@danthe3rd
Copy link
Contributor

The DualGEMM only works for sharing the same operand A tho...

smem_iterator_B1 stores in row-major

My understanding is that this will cause many bank conflicts when loading from global memory.

My idea to do this for the memory-efficient attention kernel (I think it's what you are trying to do) was to:
(1) Load in the same shared-memory format as before in Out1 = Aij.T @ Bi
(2) Keep everything in shared-memory (so with kNumStages=4)
(3) Have a custom WarpIterator to load Bi for Out2 = Bi @ Cij.T that can read from the same shared-memory. I think there are already iterators supporting this in cutlass, but to be confirmed. Otherwise it's just a matter of iterating in a different direction, and transposing the matrix as it's loaded with ldmatrix (this would only work for Sm75+, but that's good enough I believe)

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@mnicely
Copy link
Collaborator

mnicely commented Feb 3, 2023

@jfc4050 is your issue resolved?

@jfc4050
Copy link
Author

jfc4050 commented Feb 3, 2023

ah sorry forgot to close. we figured this out in a separate discussion with Haicheng. Thanks!

@jfc4050 jfc4050 closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

4 participants