Update the cuda API and enable tensor core for GEMM #9622

kexinzhao · 2018-04-04T00:32:51Z

cublasHgemm does true FP16 computation which is slow for non-Volta GPUs. So we use cublasGemmEx instead which does pesudo FP16 computation: input/output in fp16, computation in fp32, which can also be accelerated using tensor cores in volta GPUs.

By testing, I found that using GemmEx instead of Hgemm provides significant speed up on both Titan XP and V100 GPU.

Vgg16 imagenet batch size = 1, 1000 iterations total time spent on float16 mul op:

V100 GPU:
Hgemm vs GemmEx
1501 ms vs 451 ms

Titan Xp GPU:
Hgemm vs GemmEx
3259 ms vs 703ms

Tensor core example:
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

change from hgemm to gemmEx

42b6f75

kexinzhao changed the title ~~Update the cuda API for GEMM~~ Update the cuda API and enable tensor core for GEMM Apr 4, 2018

fix cpplint

96ea7b7

kexinzhao requested a review from qingqing01 April 4, 2018 01:48

kexinzhao added the 预测原名Inference，包含Capi预测问题等 label Apr 5, 2018

wangkuiyi approved these changes Apr 6, 2018

View reviewed changes

wangkuiyi merged commit d00bd9e into PaddlePaddle:develop Apr 6, 2018

kexinzhao deleted the update_fp16_gemm branch April 27, 2018 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the cuda API and enable tensor core for GEMM #9622

Update the cuda API and enable tensor core for GEMM #9622

kexinzhao commented Apr 4, 2018 •

edited

Loading

Update the cuda API and enable tensor core for GEMM #9622

Update the cuda API and enable tensor core for GEMM #9622

Conversation

kexinzhao commented Apr 4, 2018 • edited Loading

kexinzhao commented Apr 4, 2018 •

edited

Loading