[BYOC] CUTLASS integration #9261

Laurawly · 2021-10-12T16:53:55Z

As discussed in RFC: https://discuss.tvm.apache.org/t/rfc-byoc-nvidia-cutlass-integration/9147, this PR is a CUTLASS integration of GEMM to TVM. It also includes a profiler to search for best params in CUTLASS.
@masahi Please take over this PR, Thanks!

masahi · 2021-10-12T20:11:59Z

Thanks @Laurawly, I'll work on this as my top priority!

junrushao · 2021-10-12T20:23:22Z

Really cool work @Laurawly!

.gitmodules

masahi · 2021-10-13T07:08:24Z

One question for @comaniac: Currently, this integration is implemented with C source backend. But one big use case for cutlass BYOC is for dynamic workloads, for which we cannot do well at the moment. I think C source backend is not compatible with dynamic inputs, is that right? On the other hand, with json codegen/runtime, I can see a way to support dynamic inputs.

However, since cutlass is a template library, the only way it could work with the json codegen/runtime is to instantiate all possible templates that we care about ahead of time, and build them together with libtvm.so. I wonder if that is feasible. cc @Laurawly
One benefit of the above approach is that we don't need to commit the cutlass code generator into our repo, which is a big part of this PR.

comaniac · 2021-10-13T08:34:43Z

IMHO, CUTLASS doesn't naturally benefit dynamic workloads due to the exact reason you mentioned. We internally use CUTLASS for training and it works well because we JIT kernels with known shapes in runtime.

In the case of CUTLASS with BYOC in TVM for inference, my impression is we could leverage high performance kernel templates while 1) keeping the binary self-contained, 2) fusing ops, and 3) having lightweight tuning (e.g., ~10 trials similar to CUDNN). On the other hand, dynamic workloads are still challenging, and hopefully our ongoing efforts of dynamic kernel tuning and generation could be landed soon to make it happen.

masahi · 2021-10-13T10:15:25Z

hmm, are you concerned about slow performance due to lack of tuning, or the integration problem that I brought up? I wouldn't worry about the former because cublas is fast without tuning. So I expect cutlass to perform equally well.

Since our cublas offload supports dynamic inputs, not supporting dynamic inputs for cutlass would be a big bummer imo. So I want to discuss the integration problem first, and investigate performance issues later.

zhiics · 2021-10-13T11:23:21Z

I think even C source code should be able to handle dynamic shapes as well since it only expects tensors at the runtime, or I might forget something here. But in general, json format is more recommended as it is easier to debug and maintain. It is also more friendly in handling constants.

masahi · 2021-10-13T12:30:54Z

I think even C source code should be able to handle dynamic shapes as well since it only expects tensors at the runtime

Exactly, I've looked at the code a bit and it seems currently we generate an API that takes DLTensors as inputs, but it passes only raw pointers to the generated backend calls. It is easy to change things a bit to pass shape information as well.

tvm/src/relay/backend/contrib/codegen_c/codegen_c.h

Lines 176 to 185 in 3cb838d

    
           code_stream_ << func_name << "_("; 
        
           for (size_t i = 0; i < args.size(); i++) { 
        
             const auto& dtype_str = GetDtypeString(args[i]); 
        
             code_stream_ << "(" << dtype_str << "*)(arg" << i << "->data),\n"; 
        
             PrintIndents(); 
        
           } 
        
           for (size_t i = 0; i < outs.size() - 1; i++) { 
        
             code_stream_ << "(" << outs[i].dtype << "*)(out" << i << "->data),\n"; 
        
             PrintIndents(); 
        
           }

I'm glad to find at least one path that supports all use cases. For the json codegen/runtime path, I'm not sure how to integrate cutlass's JIT codegen + compile approach with it.

comaniac · 2021-10-13T16:39:36Z

Ah sorry I didn't make it clear. The interface of C source codegen does deal with dynamic workloads, because it takes the raw pointer which could be any size in run time.

What I meant was how to generate CUTLASS kernels that are able to perform well with all shapes. In @Laurawly's post, they generate lots of kernels to cover possible shapes, which result in 7GB binary. I assume they also generate a run time dispatching logic (also in the generated C source code) to determine which kernel should be used given the known shape in run time. Obviously, the binary size will definitely be an issue for this solution.

For JSON codegen/runtime, it would be similar to TensorRT: We simply dump a JSON graph in codegen without doing anything else. Meanwhile, we have a custom runtime that JITs/catches CUTLASS kernels based on known shapes. This results in a much smaller binary, but the first execution (or an execution with new shapes) may take several seconds or even a minute to JIT all kernels.

hwu36 · 2021-10-13T17:36:52Z

I assume they also generate a run time dispatching logic (also in the generated C source code) to determine which kernel should be used given the known shape in run time.

At this moment, CUTLASS does not provide heuristics to choose which kernel to use based on the runtime information. The user of CUTLASS needs to make the decision to pick kernels.

comaniac

Otherwise LGTM.

comaniac · 2021-10-28T17:20:51Z

python/tvm/contrib/cutlass/build.py

+            self.signature["ret_dtype"] = op.ret_type.dtype
+
+
+def profile_cutlass_kernels(mod, sm, profile_all=True, use_multiprocessing=False, tmp_dir="./tmp"):


I'm wondering whether we should use "profile" here. My impression to "profile" is more like an analysis rather than transformation. For example, profile_executor sounds getting a report of detail execution latencies. Maybe "tune_cutlass_kernels" (similar to AutoTVM/Ansor) or "search_cutlass_kernels" (similar to CuDNN) might be better? Would like to hear to others as well.

ok I'll change to tune_cutlass_kernels. profile is what cutlass uses in their repo (e.g. cutlass_profiler), but yeah, might be confusing to TVM users.

My only concern with tune is, people might get a wrong impression that tune_cutlass_kernels is an optional thing, since in Ansor / AutoTVM tuning is optional. We can add "default kernels" later to avoid having tuning as a hard requirement. We need default kernels for dynamic workload anyway.

That's a fair concern and I agree with your plan.

masahi · 2021-10-29T02:09:04Z

Thanks @Laurawly @comaniac @junrushao1994 @zhiics @hwu36 this is merged!

hwu36 · 2021-10-29T02:10:39Z

This is so cool. Thank you everyone.

* byoc cutlass * add cmake and fix build * test worked but accuracy is bad * fixed argument printing properly * moving files * moving contents of cutlass_profiler into python/tvm/contrib/cutlass * run black * remove irrelavant codegen code * clang format * tried replacing sm 75 with 80, didn't help improve accuracy * remove irrelavant code from generator * tried dense + bias fusion but generated cu file does not compile * dense + bias worked after adding Leyuan's patch, bias + relu worked too * tried adding sm80 generator but accuracy is still off * remove GemmUniversal generator * cleanup partition and build * moved partition, profile and build function out of test * turned out the result match's TVM non-cutlass result. Numpy fp16 matmul is busted? * clean up test * LinearCombination can be reused for bias only epilogue * remove unsupported epilogues like gelu * removing deadcode * unify gemm templates for with or without beta scaling * supported gelu but accuracy is slightly off * gelu test passed with relaxed rtol * cleanup * remove unused stuff from library.py * move profiler template into its own file * removed gemm_profiler.py * move contents of compile_engine.py into gen_gemm.py * rename to profiler_template.cu to avoid CI issue * cleaning up trying to pass pylint * add missing asf header * run black * fixing many pylint issues except wildcard import * fixed wildcard warning * add missing CUTLASS.cmake file, restore gemm_profiler.py * pylint * minor fix * add license * start filling in TODO doc * rename GemmProfiler to GemmProfilerEmitter * more renaming and doc * add doc to the main compile API * refactored generator * run black * black fix * finish doc TODO * add test for 32 bit accum * fixed kernel generator to correctly handle fp32 accum * revise build-related API * add option to profile only one kernel * add option to enable parallel compilation * clean up gen_gemm * doc update * profile_cutlass_kernels -> tune_cutlass_kernels Co-authored-by: leyuan.wang <leyuan.wang@bytedance.com> Co-authored-by: Masahiro Masuda <masahi129@gmail.com>

Laurawly added the status: WIP label Oct 12, 2021

Laurawly assigned masahi Oct 12, 2021

Laurawly requested review from anijain2305, comaniac, jroesch, junrushao, jwfromm, manupak, MarisaKirisame, mbaret, mbrookhart, slyubomirsky, trevor-m, vinx13, wweic, yzhliu, zhiics, ZihengJiang and a team as code owners October 12, 2021 16:53

junrushao reviewed Oct 12, 2021

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

junrushao reviewed Oct 12, 2021

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

masahi added 19 commits October 28, 2021 19:15

fixing many pylint issues except wildcard import

b1b3067

fixed wildcard warning

33a0b8a

add missing CUTLASS.cmake file, restore gemm_profiler.py

5be1a6e

pylint

0fa6b75

minor fix

c11e69f

add license

cd5326d

start filling in TODO doc

1e110d4

rename GemmProfiler to GemmProfilerEmitter

e2abe8f

more renaming and doc

3eb47fc

add doc to the main compile API

c641774

refactored generator

ce09fe8

run black

9a4c5f7

black fix

b0f59b7

finish doc TODO

fc52abd

add test for 32 bit accum

77fbe43

fixed kernel generator to correctly handle fp32 accum

5ae5e69

revise build-related API

9f478a7

add option to profile only one kernel

d5d03be

add option to enable parallel compilation

11838fd

masahi force-pushed the cutlass branch from ff66ef4 to 11838fd Compare October 28, 2021 11:18

clean up gen_gemm

0fb7d69

masahi force-pushed the cutlass branch from ed92610 to 0fb7d69 Compare October 28, 2021 11:48

doc update

f86197a

comaniac approved these changes Oct 28, 2021

View reviewed changes

profile_cutlass_kernels -> tune_cutlass_kernels

e6bec53

masahi merged commit 541f9f2 into apache:main Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BYOC] CUTLASS integration #9261

[BYOC] CUTLASS integration #9261

Laurawly commented Oct 12, 2021

masahi commented Oct 12, 2021

junrushao commented Oct 12, 2021

masahi commented Oct 13, 2021 •

edited

Loading

comaniac commented Oct 13, 2021

masahi commented Oct 13, 2021

zhiics commented Oct 13, 2021

masahi commented Oct 13, 2021 •

edited

Loading

comaniac commented Oct 13, 2021

hwu36 commented Oct 13, 2021

comaniac left a comment

comaniac Oct 28, 2021

masahi Oct 28, 2021

comaniac Oct 28, 2021

masahi commented Oct 29, 2021

hwu36 commented Oct 29, 2021

		self.signature["ret_dtype"] = op.ret_type.dtype


		def profile_cutlass_kernels(mod, sm, profile_all=True, use_multiprocessing=False, tmp_dir="./tmp"):

[BYOC] CUTLASS integration #9261

[BYOC] CUTLASS integration #9261

Conversation

Laurawly commented Oct 12, 2021

masahi commented Oct 12, 2021

junrushao commented Oct 12, 2021

masahi commented Oct 13, 2021 • edited Loading

comaniac commented Oct 13, 2021

masahi commented Oct 13, 2021

zhiics commented Oct 13, 2021

masahi commented Oct 13, 2021 • edited Loading

comaniac commented Oct 13, 2021

hwu36 commented Oct 13, 2021

comaniac left a comment

Choose a reason for hiding this comment

comaniac Oct 28, 2021

Choose a reason for hiding this comment

masahi Oct 28, 2021

Choose a reason for hiding this comment

comaniac Oct 28, 2021

Choose a reason for hiding this comment

masahi commented Oct 29, 2021

hwu36 commented Oct 29, 2021

masahi commented Oct 13, 2021 •

edited

Loading

masahi commented Oct 13, 2021 •

edited

Loading