-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Question On Reusing MMA Operands in Shared Memory #744
Comments
is it possible just using dual gemm from example 45 (https://github.com/NVIDIA/cutlass/tree/master/examples/45_dual_gemm) which is written by @danthe3rd? |
The DualGEMM only works for sharing the same operand A tho...
My understanding is that this will cause many bank conflicts when loading from global memory. My idea to do this for the memory-efficient attention kernel (I think it's what you are trying to do) was to: |
This issue has been labeled |
@jfc4050 is your issue resolved? |
ah sorry forgot to close. we figured this out in a separate discussion with Haicheng. Thanks! |
Hello!
I'm looking to optimize an algorithm that does the following operations:
implementing this as back-to-back threadblock level GEMMs has the problem of loading Bi from global memory twice. Bi usually has <= 128 cols, so it should be possible to store the entire (block_rows, total_cols) Bi matrix in shared memory.
I have taken a look at the B2B GEMM examples that do the following
I do see how I could extend that idea by loading a slice of B into shared memory first, then perform Aij.T @ Bi and Bi @ Cij.T, but it seems like it would be more efficient to allow Bi to be loaded to shmem in a pipelined/multistage manner during the computation for Out1, then reuse the tile of Bi in the next matmul B @ C. is that a reasonable guess?
I'm trying to prototype this now, but not quite sure what shared memory layout to use for Bi, since it would need to be the same for both Operand A of the first MMA and operand B of the second MMA. Would there be any issues if i just made them both in row-major (ie smem_iterator_B1 stores in row-major and warp_smem_iterator_B1/warp_smem_iterator_A2 reads from row-major)? looks like the default MMA implementations typically specify smem layouts like
RowMajorTensorOpMultiplicandCongruous
, with different Crosswise values depending on if its operand A or BThe text was updated successfully, but these errors were encountered: