-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
onnxruntime inference is around 5 times slower than pytorch when using GPU #10303
Comments
I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time. |
Thanks for the reply
Unfortunately, still see not much of a difference in the pytorch inference time
The onnxruntime inference is still the same (around 48 ms), so its still around 5 times slower |
I think you need to put the record and the synchronize inside your loop. Something like that with torch.no_grad():
resnet.eval()
for i in range(total_samples):
start.record()
out = resnet(x)
end.record()
torch.cuda.synchronize()
total_time += start.elapsed_time(end) |
Changed it the way you have suggested. |
Ok so it's probably not that |
Is the input actually on GPU? In the code you provided, I see: Some other things to try: You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info. |
@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again? |
and I still see no difference in the timing
|
As you can see for my above reply, io_binding.bind_cpu_input should also do what you are suggesting. Nevertheless, once I modify my code as -
still the timings on GPU for 1000 samples are -
|
The official docs are here: https://onnxruntime.ai/docs/api/python/api_summary.html https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.IOBinding.bind_cpu_input
@RandySheriffH is there some documentation about CUDA kernel profiling in ORT? You can look for host to device memcpy's with nvprof to see whether the input is actually getting copied from CPU to GPU. |
I could not find the issue with profiling, as everything seems to take more time |
@nssrivathsa, any update on this? Are you able to find the cause? |
@tianleiwu No, i am still not able to find the cause. But, like I mentioned, it does not happen when i run it with some advanced GPUs |
@nssrivathsa Even on Tesla T4, I am seeing 2x slow down using onnx versus pytorch + fp16. |
@yunjiangster I haven't used any quantization yet, the script is same as what I have shared above and with that I see improvements when i use it with Tesla T4 on AWS |
We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks! FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help |
Seeing similar problems as well -- we saw 2.3x slower on Could people from Microsoft step out and confirm whether it's the case? Thanks |
I've made a new issue for our problem as it might not be fully related to this one, since I cannot reproduce the same slow-down with the ResNet model |
@CanyonWind &thomas-beznik, I recommend nvprof tool to profile your model. It will tell you which kernel uses most kernel time. For example, older version of huggingface stable diffusion pipeline, the UNet model has ScatterND, which is very slow in ORT. It is able to modify the modeling code to avoid those slow operators. For diffusion model with Conv operator, I suggest try cudnn_conv_use_max_workspace in CUDA provider option like this. Sometime, it could make big difference. In my benchmark of huggingface stable diffusion pipeline, if you use original ONNX FP32 model, it could be 2x slower. If you combine fp16 and conv algo search, ORT could be 25% faster than Pytorch:
|
@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output: It seems that ORT is much faster than PyTorch in V100. Here is the script:
|
Similar issue described here - #12880 (comment) |
@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch. |
@ma-xu, what environment (PyTorch, CUDA, cuDNN verison and GPU) did you find ONNX is much slower than PyTorch? Onnx Runtime and PyTorch both use cuDNN as backend for convolution. I would expect the latency is close for large batch size. If you find significant difference for large batch size, that usually indicates some integration issue. I did a test of batch size 128 in V100 with CUDA 11.7, PyTorch 1.13.1+cu117, ORT 1.14.0. ORT is still faster:
|
@tianleiwu Thanks for your reply. It may be caused my some onnx errors, I uploaded the output here. Hope this can help. My configure is: |
I have an exported model from Tensorflow to onnx. The onnx runtime using DirectML is also very slow on RDNA2 platform. It is about 3 times faster than CPU, but that is really slow. You can even see that the GPU is not utilised properly as the power consumption of the card won't go up. |
I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much! |
Describe the bug
Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU
System information
To Reproduce
Current behavior
If run on CPU,
but, if run on GPU, I see
If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.
I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU.
When converting the resnet to onnx, I see traces like
%193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0),
so, the nodes of the model are on GPU.
Further, during the processing for onnxruntime, I print device usage stats and I see this -
So, GPU device is being used.
Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.
So, I cannot seem to figure this out any further and I am stuck here since quite a few days.
Could somebody please point out to what could be the issue here?
Thanks
The text was updated successfully, but these errors were encountered: