[Kernel] Refactor Cutlass c3x #10049

varun-sundar-rabindranath · 2024-11-05T20:22:28Z

Refactor Cutlass c3x kernels for better maintainability and easier experimentation.

Break scaled_mm_c3x.cu into,
- scaled_mm_c3x.cuh : All the base cutlass c3x code (cutlass_3x_gemm and cutlass_gemm_caller).
- scaled_mm_c3x_sm90_fp8_dispatch.cuh : All fp8 kernels along with the gemm shape based dispatch function.
- scaled_mm_c3x_sm90_int8_dispatch.cuh : All int8 kernels along with the gemm shape based dispatch function.
- scaled_mm_c3x.cu : interfaces expected by scaled_mm_entry.cu

github-actions · 2024-11-05T20:22:41Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

varun-sundar-rabindranath · 2024-11-12T03:29:34Z

@tlrmchlsmth @ProExpertProg @LucasWilkinson PTAL. Thanks!

LucasWilkinson · 2024-11-12T21:20:30Z

LGTM (just FYI may conflict with #9855)

tlrmchlsmth

LGTM

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cuh

mergify · 2024-11-18T20:01:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2024-12-16T18:22:55Z

@tlrmchlsmth re-requesting review as the PR is now rebased.

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu

tlrmchlsmth

Looks good! (Likely wait for #10995)

tlrmchlsmth · 2024-12-16T19:08:48Z

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu

@@ -1,384 +1,22 @@
-// clang-format will break include orders
+// switch off clang format as the include statement indentation is inconsistent.


Actually do you need to turn off clang-format here? The reason for turning it off is that CUTLASS headers need to be included in a specific order but it looks like that's not the case following the refactor

Using clang-format in this block, turns,

#include <cudaTypedefs.h> #if defined CUDA_VERSION && CUDA_VERSION >= 12000 #include "scaled_mm_c3x_sm90_fp8_dispatch.cuh" #include "scaled_mm_c3x_sm90_int8_dispatch.cuh" #include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp" using namespace vllm;

into

#include <cudaTypedefs.h> #if defined CUDA_VERSION && CUDA_VERSION >= 12000 #include "scaled_mm_c3x_sm90_fp8_dispatch.cuh" #include "scaled_mm_c3x_sm90_int8_dispatch.cuh" #include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp" using namespace vllm;

the #if seems trigger inconsistent indenting. I switched off clang-format in this block to avoid that.

I moved the original clang-format toggle to scaled_mm_c3x.cuh.

vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cuh

Line 3 in dad8e47

// clang-format will break include orders

Had second thoughts about this and removed the clang-format block. It is probably better to stick to the convention.

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath marked this pull request as draft November 5, 2024 20:35

varun-sundar-rabindranath marked this pull request as ready for review November 12, 2024 03:27

tlrmchlsmth approved these changes Nov 12, 2024

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 12, 2024

ProExpertProg reviewed Nov 12, 2024

View reviewed changes

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cuh Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/cutlass-c3x-refactor branch 2 times, most recently from 7dbe3b3 to 4f44aac Compare November 15, 2024 14:35

mergify bot added the needs-rebase label Nov 18, 2024

varun-sundar-rabindranath force-pushed the varun/cutlass-c3x-refactor branch from 4f44aac to 16879db Compare December 16, 2024 18:03

mergify bot removed the needs-rebase label Dec 16, 2024

varun-sundar-rabindranath requested a review from tlrmchlsmth December 16, 2024 18:22

tlrmchlsmth reviewed Dec 16, 2024

View reviewed changes

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Dec 16, 2024

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/cutlass-c3x-refactor branch 2 times, most recently from 55d9927 to e5f324b Compare December 18, 2024 16:13

Varun Sundar Rabindranath added 2 commits December 19, 2024 03:28

Refactor cutlass-c3x

4068c8e

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix common header

e033b41

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath force-pushed the varun/cutlass-c3x-refactor branch from e5f324b to e033b41 Compare December 19, 2024 03:29

tlrmchlsmth enabled auto-merge (squash) December 19, 2024 03:31

tlrmchlsmth merged commit 8936316 into vllm-project:main Dec 19, 2024
75 checks passed

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[Kernel] Refactor Cutlass c3x (vllm-project#10049)

3c57a07

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Refactor Cutlass c3x #10049

[Kernel] Refactor Cutlass c3x #10049

varun-sundar-rabindranath commented Nov 5, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 5, 2024

varun-sundar-rabindranath commented Nov 12, 2024

LucasWilkinson commented Nov 12, 2024

tlrmchlsmth left a comment

mergify bot commented Nov 18, 2024

varun-sundar-rabindranath commented Dec 16, 2024

tlrmchlsmth left a comment

tlrmchlsmth Dec 16, 2024

varun-sundar-rabindranath Dec 16, 2024

varun-sundar-rabindranath Dec 17, 2024

		@@ -1,384 +1,22 @@
		// clang-format will break include orders
		// switch off clang format as the include statement indentation is inconsistent.

[Kernel] Refactor Cutlass c3x #10049

[Kernel] Refactor Cutlass c3x #10049

Conversation

varun-sundar-rabindranath commented Nov 5, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 5, 2024

varun-sundar-rabindranath commented Nov 12, 2024

LucasWilkinson commented Nov 12, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 18, 2024

varun-sundar-rabindranath commented Dec 16, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth Dec 16, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Dec 16, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Dec 17, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Nov 5, 2024 •

edited by github-actions bot

Loading