Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the cuda API and enable tensor core for GEMM #9622

Merged
merged 2 commits into from
Apr 6, 2018

Conversation

kexinzhao
Copy link
Contributor

@kexinzhao kexinzhao commented Apr 4, 2018

fix #9625
fix #9626

cublasHgemm does true FP16 computation which is slow for non-Volta GPUs. So we use cublasGemmEx instead which does pesudo FP16 computation: input/output in fp16, computation in fp32, which can also be accelerated using tensor cores in volta GPUs.

By testing, I found that using GemmEx instead of Hgemm provides significant speed up on both Titan XP and V100 GPU.

Vgg16 imagenet batch size = 1, 1000 iterations total time spent on float16 mul op:

V100 GPU:
Hgemm vs GemmEx
1501 ms vs 451 ms

Titan Xp GPU:
Hgemm vs GemmEx
3259 ms vs 703ms

Tensor core example:
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

@kexinzhao kexinzhao changed the title Update the cuda API for GEMM Update the cuda API and enable tensor core for GEMM Apr 4, 2018
@kexinzhao kexinzhao added the 预测 原名Inference,包含Capi预测问题等 label Apr 5, 2018
@wangkuiyi wangkuiyi merged commit d00bd9e into PaddlePaddle:develop Apr 6, 2018
@kexinzhao kexinzhao deleted the update_fp16_gemm branch April 27, 2018 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
预测 原名Inference,包含Capi预测问题等
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to enable tensor core for cublas gemm Need to replace Hgemm with faster fp16 GEMM kernel
2 participants