-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Paddle-Inference] Matmul_int8_convert: tensor*tensor #37285
[Paddle-Inference] Matmul_int8_convert: tensor*tensor #37285
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for PADDLE_ENFORCE
int32_t pos, nvinfer1::PluginTensorDesc const* inOut, int32_t nbInputs, | ||
int32_t nbOutputs) const TRT_NOEXCEPT { | ||
PADDLE_ENFORCE_EQ(nbInputs, 2, | ||
platform::errors::InvalidArgument("Must have 2 inputs, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议报错信息带一些环境信息,这样报错,用户可能不知道是什么场景?什么地方?需要2个输入,后续可以再补充一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,下次pr我加一下~ thanks~
…7285) * matmul_convert_int8 * matmul_convert_int8 * matmulconvert_int8 * Matmul_int8_convert: tensor*tensor * Matmul_int8_convert: tensor*tensor * Matmul_int8_convert: tensor*tensor
PR types
Others
PR changes
Others
Describe
增加matmul int8 量化的推理 op_convert 和 plugin:通过调用nvidia 显卡的 Tensor Core提高矩阵乘的计算速度,plugin 的实现包括 int8、fp16、fp32;通过将alpha传入plugin内与矩阵乘一起进行计算,实现matmul+scale的融合,加速推理;增加 dynload 动态加载 libcublasLt.so 的实现;增加对应量化的单测
性能测试:A(1, 28, 256, 1024)*B(1, 28, 1024, 256)
kernel(matmul和scale融合)的执行时间:
单OP(matmul和scale融合)网络的执行时间:(int8 的matmul 需要对输入数据重新排布来支持 tensor core,反而会增加耗时,只有在矩阵规模十分庞大时,才能体现矩阵计算的加速效果;本op的实现中可根据对tensor的预分析,自动判断选择性能最佳的 int8、fp16、fp32的plugin)
kernel的执行时间:
单OP网络的执行时间:
总结:当矩阵较大时,matmul int8 op的加速性能较为明显;当存在scale的op融合时,加速性能比较明显
另:matmul int8的显存会有约 5% 的略微减小