Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM] Per-input, data dependence specification for shape func #7210

Merged
merged 13 commits into from
Jan 15, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 5, 2021

Motivation

Currently, shape functions are executed on CPU, even if the model is running on GPU. Each shape function is declared with data_dependent flag, specifying whether or not the shape function needs the input tensors themselves or only the shapes of input tensors, to compute output shape:

def register_shape_func(op_name, data_dependant, shape_func=None, level=10):
"""Register operator shape function for an op.
Parameters
----------
op_name : str
The name of the op.
data_dependant : bool
Whether the shape function depends on input data.

When an op's data_dependent is true, VM will do device to host memcpy of entire tensors before running that op's shape func on CPU. In particular, since dyn.strided_slice has data_dependent True, VM would do device to host memcpy of a to-be-sliced tensor for every dyn.strided_sllice invocation, which can be highly expensive if the tensor is big.

@_reg.register_shape_func("dyn.strided_slice", True)

In fact, one of the bottlenecks of running PyTorch MaskRCNN on GPU is this repeated device to host memcpy, as shown in the profiler output below. Most of them is for dyn.strided_slice shape func. CUDA memcpy HtoD is also very slow, but this is necessary to send large parameters once to GPU.

==2531== Profiling application: python maskrcnn_test.py                                                                                                                      
==2531== Profiling result:                                                                                                                                                   
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name                                                                                            
 GPU activities:   16.59%  32.814ms        58  565.76us  30.112us  1.0854ms  fused_expand_dims_concatenate_1_kernel0                                                         
                    9.37%  18.524ms       475  38.998us     544ns  5.3433ms  [CUDA memcpy HtoD]                                                                              
                    9.09%  17.983ms         4  4.4957ms  4.4597ms  4.5401ms  fused_nn_conv2d_add_nn_relu_20_kernel0                                                          
                    8.77%  17.350ms         1  17.350ms  17.350ms  17.350ms  fused_nn_conv2d_transpose_add_nn_relu_kernel0                                                   
                    7.54%  14.912ms      2283  6.5310us     672ns  7.8311ms  [CUDA memcpy DtoH]                                                                              
                    4.33%  8.5701ms         1  8.5701ms  8.5701ms  8.5701ms  fused_nn_conv2d_add_5_kernel0                                                                   
                    4.27%  8.4386ms         1  8.4386ms  8.4386ms  8.4386ms  fused_nn_conv2d_add_nn_relu_16_kernel0 
                    2.68%  5.3057ms         2  2.6528ms  399.20us  4.9065ms  sgemm_128x128x8_NT_vec                      

But if we think about it, these expensive copyings are completely useless: shape func of dyn.strided_slice only needs data shape. But since other arguments begin, end, strides require their tensor values, dyn.strided_slice needs to declare its data_dependant flag to be True. This issue can be resolved if we let each op declare its data dependence per input.

Proposed solution

Luckily, decisions to insert device-to-host memcpy before shape func is already done per each input separately, as shown below:

for i, (arg, state) in enumerate(zip(new_args, input_states)):
state = int(state)
# Pass Shapes
if state == 2:
for j, subexp in enumerate(from_tuple_type(arg.type_annotation, arg)):
sh_of = self.visit(self.shape_of(subexp))
shape_func_ins.append(scope.let("in_shape_{0}".format(input_pos + j), sh_of))
input_pos += 1
is_inputs.append(0)
# Pass Inputs
elif state == 1:
new_arg = self.visit(arg)
ctx = self.get_context(arg)
if ctx.device_type != cpu_ctx.device_type:
new_arg = self.device_copy(new_arg, ctx, cpu_ctx)

So all we need to do is to have a way to specify per-input data dependence for each op, and send these information to ManifestAllocPass above. This PR enables such specification like this:

@_reg.register_shape_func("dyn.strided_slice", [False, True, True, True])

I've also updated compile_engine.cc accordingly to send per-input data dep information to ManifestAllocPass. The change I made is minimum necessary to achieve my goal, so it can be improved. With this PR, I was able to remove all expensive device-to-host memcpy, and it cuts GPU MaskRCNN runtime by 14 millisecond. More importantly, the purpose of this PR is to let people aware of this problem, and decide the best solution.

please review @icemelon9 @zhiics @kevinthesun @mbrookhart

@masahi masahi changed the title [VM] Per-input, data dependence speficiation for shape func [VM] Per-input, data dependence specification for shape func Jan 5, 2021
@mbrookhart
Copy link
Contributor

I like this, I think it makes a lot of sense, but I'll defer mainly to @zhiics and @icemelon9 since they mainly implemented the infrastructure for heterogeneous shape functions.

@masahi
Copy link
Member Author

masahi commented Jan 14, 2021

cc @icemelon9 @zhiics Any thought on this issue? I think this is important to discuss.

Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with the change but I am not sure if there is a better solution. @icemelon9 can you take a look?

python/tvm/relay/op/op.py Outdated Show resolved Hide resolved
@masahi masahi force-pushed the shape-func-per-input-data-dependent branch 2 times, most recently from 1ffa75a to f658556 Compare January 14, 2021 19:52
include/tvm/relay/op_attr_types.h Outdated Show resolved Hide resolved
src/relay/backend/compile_engine.cc Outdated Show resolved Hide resolved
src/relay/backend/compile_engine.cc Outdated Show resolved Hide resolved
@masahi masahi force-pushed the shape-func-per-input-data-dependent branch from f658556 to 11bd3f5 Compare January 15, 2021 01:43
@masahi masahi force-pushed the shape-func-per-input-data-dependent branch from 11bd3f5 to 6c1b318 Compare January 15, 2021 09:14
@masahi
Copy link
Member Author

masahi commented Jan 15, 2021

@icemelon9 it should be ready

Copy link
Member

@icemelon icemelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@icemelon icemelon merged commit 2992e9b into apache:main Jan 15, 2021
@icemelon
Copy link
Member

Thanks @masahi @zhiics @mbrookhart

masahi added a commit to masahi/tvm that referenced this pull request Jan 18, 2021
…7210)

* made TShapeDataDependant array

* add stub

* dyn strided slice working

* reshape also working

* remove log

* works on maskrcnn

* lint fix

* fix cpp test

* remove stale pop back

* add more doc

* dependant -> dependent

* remove redundant check

* remove data_dependent_
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
…7210)

* made TShapeDataDependant array

* add stub

* dyn strided slice working

* reshape also working

* remove log

* works on maskrcnn

* lint fix

* fix cpp test

* remove stale pop back

* add more doc

* dependant -> dependent

* remove redundant check

* remove data_dependent_
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
…7210)

* made TShapeDataDependant array

* add stub

* dyn strided slice working

* reshape also working

* remove log

* works on maskrcnn

* lint fix

* fix cpp test

* remove stale pop back

* add more doc

* dependant -> dependent

* remove redundant check

* remove data_dependent_
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
…7210)

* made TShapeDataDependant array

* add stub

* dyn strided slice working

* reshape also working

* remove log

* works on maskrcnn

* lint fix

* fix cpp test

* remove stale pop back

* add more doc

* dependant -> dependent

* remove redundant check

* remove data_dependent_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants