[QST] Question On Reusing MMA Operands in Shared Memory #744

jfc4050 · 2022-12-23T22:11:14Z

Hello!

I'm looking to optimize an algorithm that does the following operations:

Out1 = Aij.T @ Bi
Out2 = Bi @ Cij.T

implementing this as back-to-back threadblock level GEMMs has the problem of loading Bi from global memory twice. Bi usually has <= 128 cols, so it should be possible to store the entire (block_rows, total_cols) Bi matrix in shared memory.

I have taken a look at the B2B GEMM examples that do the following

Out1 = A @ B
Out2 = Out1 @ C

I do see how I could extend that idea by loading a slice of B into shared memory first, then perform Aij.T @ Bi and Bi @ Cij.T, but it seems like it would be more efficient to allow Bi to be loaded to shmem in a pipelined/multistage manner during the computation for Out1, then reuse the tile of Bi in the next matmul B @ C. is that a reasonable guess?

I'm trying to prototype this now, but not quite sure what shared memory layout to use for Bi, since it would need to be the same for both Operand A of the first MMA and operand B of the second MMA. Would there be any issues if i just made them both in row-major (ie smem_iterator_B1 stores in row-major and warp_smem_iterator_B1/warp_smem_iterator_A2 reads from row-major)? looks like the default MMA implementations typically specify smem layouts like RowMajorTensorOpMultiplicandCongruous, with different Crosswise values depending on if its operand A or B

The text was updated successfully, but these errors were encountered:

hwu36 · 2022-12-24T06:04:15Z

is it possible just using dual gemm from example 45 (https://github.com/NVIDIA/cutlass/tree/master/examples/45_dual_gemm) which is written by @danthe3rd?

danthe3rd · 2022-12-24T10:40:11Z

The DualGEMM only works for sharing the same operand A tho...

smem_iterator_B1 stores in row-major

My understanding is that this will cause many bank conflicts when loading from global memory.

My idea to do this for the memory-efficient attention kernel (I think it's what you are trying to do) was to:
(1) Load in the same shared-memory format as before in Out1 = Aij.T @ Bi
(2) Keep everything in shared-memory (so with kNumStages=4)
(3) Have a custom WarpIterator to load Bi for Out2 = Bi @ Cij.T that can read from the same shared-memory. I think there are already iterators supporting this in cutlass, but to be confirmed. Otherwise it's just a matter of iterating in a different direction, and transposing the matrix as it's loaded with ldmatrix (this would only work for Sm75+, but that's good enough I believe)

github-actions · 2023-01-24T03:05:02Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mnicely · 2023-02-03T14:23:12Z

@jfc4050 is your issue resolved?

jfc4050 · 2023-02-03T16:40:28Z

ah sorry forgot to close. we figured this out in a separate discussion with Haicheng. Thanks!

jfc4050 added ? - Needs Triage question Question labels Dec 23, 2022

jfc4050 changed the title ~~[QST] Restrictions on Smem Tile Layout~~ [QST] Question On Reusing MMA Operands in Shared Memory Dec 23, 2022

danthe3rd mentioned this issue Dec 24, 2022

[feat] cutlass FlashAttention bias+dropout support facebookresearch/xformers#587

Merged

mnicely removed the ? - Needs Triage label Dec 25, 2022

github-actions bot added the inactive-30d label Jan 24, 2023

github-actions bot removed the inactive-30d label Feb 3, 2023

jfc4050 closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Question On Reusing MMA Operands in Shared Memory #744

[QST] Question On Reusing MMA Operands in Shared Memory #744

jfc4050 commented Dec 23, 2022

hwu36 commented Dec 24, 2022

danthe3rd commented Dec 24, 2022

github-actions bot commented Jan 24, 2023

mnicely commented Feb 3, 2023

jfc4050 commented Feb 3, 2023

[QST] Question On Reusing MMA Operands in Shared Memory #744

[QST] Question On Reusing MMA Operands in Shared Memory #744

Comments

jfc4050 commented Dec 23, 2022

hwu36 commented Dec 24, 2022

danthe3rd commented Dec 24, 2022

github-actions bot commented Jan 24, 2023

mnicely commented Feb 3, 2023

jfc4050 commented Feb 3, 2023