Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA]batch_matmul tensorcore schedule #7146

Merged
merged 12 commits into from
Jan 11, 2021
Merged

Conversation

Meteorix
Copy link
Contributor

Add batch_matmul tensorcore schedule for bert inference. It shows better performance than cublas batch_matmul kernel.

@jcf94 @merrymercy could you help review this pr?

Comment on lines 665 to 667
if ((M % 8 == 0 and K % 16 == 0 and N % 32 == 0) or \
(M % 16 == 0 and K % 16 == 0 and N % 16 == 0) or \
(M % 32 == 0 and K % 16 == 0 and N % 8 == 0)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to also add data type check here or use some other user defined options?
TensorCore needs to be computed in float16, but I'm not sure if this will bring any loss in precision if we just try to transform all float32 batch_matmul ops to compute in lower precision.
Besides, TensorCore can also support datatype like int8 in some higher cuda versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just kept the same with code for dense_tensorcore https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/cuda.py#L679

Copy link
Member

@merrymercy merrymercy Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bug in dense_tensorcore. We should not follow that.

shared_shedule(BS, BS_align)

shape = (wmma_m, wmma_n, wmma_k)
in_dtype = 'float16'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concerns about the data type as above.
It's fine for this PR, but will be better to add more check or just put some comments saying that the TensorCore needs to use a special data type, then if some one meets any trouble, they can know how to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 8) or (8, 16, 32) for now"

x_16 = te.compute((batch, M, K), lambda b, i, k: x[b, i, k].astype('float16'))
y_16 = te.compute((batch, N, K), lambda b, j, k: y[b, j, k].astype('float16'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@jcf94
Copy link
Contributor

jcf94 commented Dec 22, 2020

@Meteorix, great thanks for your PR! The code looks good to me.

jcf94
jcf94 previously approved these changes Dec 23, 2020
Copy link
Contributor

@jcf94 jcf94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is fine. Please fix the CI problem.
cc @merrymercy

@jwfromm
Copy link
Contributor

jwfromm commented Dec 23, 2020

@Meteorix out of curiosity can you share some of your benchmarking results? I'd love to know how much faster this performs than cublas.

def verify_batch_matmul(x_batch, y_batch, M, N, K):
x = te.placeholder((x_batch, M, K), name="x")
y = te.placeholder((y_batch, N, K), name="y")
dtype = x.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth testing other datatypes as well, especially float16.

Copy link
Member

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

@merrymercy
Copy link
Member

cc tensor core maintainers @vinx13 @Laurawly @Hzfengsy

@jcf94 jcf94 dismissed their stale review December 24, 2020 01:49

Will re-check this PR later.

@Meteorix
Copy link
Contributor Author

@Meteorix out of curiosity can you share some of your benchmarking results? I'd love to know how much faster this performs than cublas.

@jwfromm following are some of the benchmark(tuning 1000 times). This schedule beat cublas on some shapes. That is also why I made batch_matmul_cublas autotunable in this pr.

Shape: [1, 64, 1024] [1, 4096, 1024]
batch_matmul_tensorcore.cuda   2.9238894640234948e-05
batch_matmul_cublas.cuda       2.7487557097865394e-05 
batch_matmul.cuda              0.00014189747117647058

Shape: [1, 64, 1024] [1, 1024, 1024]
batch_matmul_tensorcore.cuda   1.5578384301061096e-05 
batch_matmul_cublas.cuda       2.041829239101948e-05
batch_matmul.cuda              6.108717968157696e-05

Shape: [1, 128, 1024] [1, 4096, 1024]
batch_matmul_tensorcore.cuda   0.00011345079327976625
batch_matmul_cublas.cuda       0.00011074180193236715 
batch_matmul.cuda              0.00024510443407707913

Shape: [1, 128, 4096] [1, 1024, 4096]
batch_matmul_tensorcore.cuda   0.00017083510384959715
batch_matmul_cublas.cuda       0.00010608833085714285 
batch_matmul.cuda              0.00035638234315169367

Shape: [16, 128, 64] [16, 128, 64]
batch_matmul_cublas.cuda       6.046038943091678e-06
batch_matmul_tensorcore.cuda   4.134768131265665e-06 
batch_matmul.cuda              1.2430305571941866e-05

Shape: [16, 128, 128] [16, 64, 128]
batch_matmul_tensorcore.cuda   4.74178964860194e-06 
batch_matmul_cublas.cuda       9.463372359711623e-06
batch_matmul.cuda              1.4179731404708587e-05

Shape: [1, 128, 1024] [1, 1024, 1024]
batch_matmul_tensorcore.cuda   3.857668104222821e-05
batch_matmul_cublas.cuda       2.3704257450575394e-05 
batch_matmul.cuda              0.0002515613367983368

@Meteorix
Copy link
Contributor Author

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

@merrymercy I see your point. Maybe we can discuss it with other tensor core maintainers and file another pr to resolve this issue?

@Laurawly
Copy link
Contributor

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

@merrymercy I see your point. Maybe we can discuss it with other tensor core maintainers and file another pr to resolve this issue?

I agree with @merrymercy and think we should fix the type issue that we overlooked before. We can either fix it in this PR or in a separate parallel PR. I'd like to help with that.

@jcf94
Copy link
Contributor

jcf94 commented Dec 24, 2020

Thanks! @Laurawly @merrymercy
I think it's fine to fix them in a new PR.

@Meteorix If we're not going to finish these here, you can add some TODO comments in the code and create a new issue for tracking. Please fix the CI problem and we can merge this.
#7147 is also fine, just need to add some unit tests for these modifications.

@@ -657,6 +657,23 @@ def batch_matmul_strategy_cuda(attrs, inputs, out_type, target):
name="batch_matmul_cublas.cuda",
plevel=15,
)
if target.kind.name == "cuda" and nvcc.have_tensorcore(tvm.gpu(0).compute_version):
Copy link
Contributor

@Laurawly Laurawly Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use nvcc.have_tensorcore(target=target) here since tvm.gpu(0) might not exist.

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Meteorix
Copy link
Contributor Author

Meteorix commented Jan 8, 2021

@jcf94 @merrymercy @Laurawly finally the ci passed. Also I have fixed the dtype check for batch_matmul. Please review this mr again.

Copy link
Contributor

@jcf94 jcf94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Meteorix Thanks! LGTM.

@jcf94 jcf94 dismissed merrymercy’s stale review January 11, 2021 01:50

Let's dismiss this temporary and we can continue on #7147

@jcf94 jcf94 merged commit 89e3688 into apache:main Jan 11, 2021
tkonolige pushed a commit to tkonolige/incubator-tvm that referenced this pull request Jan 11, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
@tqchen
Copy link
Member

tqchen commented Jan 12, 2021

@jcf94 @Meteorix @jwfromm because our TOPI test stage does not gaurantee uses the tensorcore GPU(we had two pascal GPUs), it would be useful to optionally skip it, to avoid flaky CI error on the main.

@tqchen
Copy link
Member

tqchen commented Jan 13, 2021

created #7277 to track the issue

masahi pushed a commit to masahi/tvm that referenced this pull request Jan 14, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
masahi pushed a commit to masahi/tvm that referenced this pull request Jan 18, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
* add batch_matmul_tensorcore

* add bmm cublas autotune

* add bmm tests

* out_shape for bmm_tensorcore

* fix comments

* code format

* add todos for tensorcore datatype checking

* fix lint

* fix have_tensorcore

* add dtype check for batch_matmul_tensorcore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants