onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

nssrivathsa · 2022-01-17T08:05:42Z

Describe the bug
Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
ONNX Runtime installed from (source or binary): Source
ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
Python version: 3.8.12
CUDA/cuDNN version: cuda version 11.5, cudnn version 8.2
GPU model and memory: Quadro M2000M, 4 GB

To Reproduce

batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}        
               )
                  
def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)
   x = to_numpy(x)

   so = onnxruntime.SessionOptions()
   so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
   so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
   
   exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   options = ort_session.get_provider_options()
   cuda_options = options['CUDAExecutionProvider']
   cuda_options['cudnn_conv_use_max_workspace'] = '1'
   ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])

   #IOBinding
   input_names = ort_session.get_inputs()[0].name
   output_names = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_cpu_input(input_names, x)
   io_binding.bind_output(output_names, device)
   
   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

Current behavior

If run on CPU,

Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms

but, if run on GPU, I see

Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms

If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.

I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU.
When converting the resnet to onnx, I see traces like
%193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0),
so, the nodes of the model are on GPU.

Further, during the processing for onnxruntime, I print device usage stats and I see this -

Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached:    0.1 GB

So, GPU device is being used.
Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.

So, I cannot seem to figure this out any further and I am stuck here since quite a few days.
Could somebody please point out to what could be the issue here?

Thanks

The text was updated successfully, but these errors were encountered:

adepierre · 2022-01-17T14:43:04Z

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

nssrivathsa · 2022-01-17T17:37:02Z

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

Thanks for the reply
I tried what was mentioned in the link you provided by modifying my infer_pytorch(resnet) function as

   start = torch.cuda.Event(enable_timing=True)
   end = torch.cuda.Event(enable_timing=True)
   start.record()
   for i in range(total_samples):
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
   end.record()
   torch.cuda.synchronize()
   print('Number of runs:', total_samples)
   print("Average PyTorch {} Inference time = {} ms".format(device.type, start.elapsed_time(end)/total_samples))

Unfortunately, still see not much of a difference in the pytorch inference time

Average PyTorch cuda Inference time = 9.928244140625 ms

The onnxruntime inference is still the same (around 48 ms), so its still around 5 times slower

adepierre · 2022-01-17T17:43:29Z

I think you need to put the record and the synchronize inside your loop. Something like that

with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

nssrivathsa · 2022-01-17T17:51:20Z

I think you need to put the record and the synchronize inside your loop. Something like that

with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

Changed it the way you have suggested.
Still get it as,
Average PyTorch cuda Inference time = 9.747972169876098 ms

adepierre · 2022-01-17T17:53:24Z

Ok so it's probably not that

edgchen1 · 2022-01-19T01:19:40Z

Is the input actually on GPU? In the code you provided, I see:
io_binding.bind_cpu_input(input_names, x)

Some other things to try:
ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

tianleiwu · 2022-01-21T18:25:17Z

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

nssrivathsa · 2022-01-24T05:40:49Z

Is the input actually on GPU? In the code you provided, I see: io_binding.bind_cpu_input(input_names, x)

Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

@edgchen1

Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding. Nevertheless, I have modified it as

data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

and I still see no difference in the timing

I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

nssrivathsa · 2022-01-24T05:47:29Z

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

@tianleiwu

As you can see for my above reply, io_binding.bind_cpu_input should also do what you are suggesting. Nevertheless, once I modify my code as -

data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

still the timings on GPU for 1000 samples are -

Average onnxruntime cuda Inference time = 44.55 ms
Average PyTorch cuda Inference time = 9.780553787231446 ms

edgchen1 · 2022-01-24T17:03:14Z

Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding.

The official docs are here: https://onnxruntime.ai/docs/api/python/api_summary.html

https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.IOBinding.bind_cpu_input
The input is on CPU so it will need to be copied to the GPU.

I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

@RandySheriffH is there some documentation about CUDA kernel profiling in ORT?

You can look for host to device memcpy's with nvprof to see whether the input is actually getting copied from CPU to GPU.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview
Visual profiler can be used to view the nvprof results.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual

nssrivathsa · 2022-03-03T06:13:16Z

I could not find the issue with profiling, as everything seems to take more time
But, I observed that running it on a machine with a more advanced GPU ensures that onnxruntime is faster than Pytorch
For e.g. on a Tesla T4 or Tesla V100 GPU, onnxruntime is around 4 times faster than pytorch

tianleiwu · 2022-06-24T17:39:16Z

@nssrivathsa, any update on this? Are you able to find the cause?
If you have nvprof results of ORT and PyTorch. You can compare the profiling result to see which cuda kernel or memory copy that ORT spent more time.

nssrivathsa · 2022-06-30T12:24:08Z

@tianleiwu No, i am still not able to find the cause. But, like I mentioned, it does not happen when i run it with some advanced GPUs

yunjiangster · 2022-07-01T06:27:44Z

@nssrivathsa Even on Tesla T4, I am seeing 2x slow down using onnx versus pytorch + fp16.

nssrivathsa · 2022-07-05T09:49:39Z

@yunjiangster I haven't used any quantization yet, the script is same as what I have shared above and with that I see improvements when i use it with Tesla T4 on AWS

thomas-beznik · 2022-09-07T13:08:00Z

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

CanyonWind · 2022-09-07T17:33:48Z

Seeing similar problems as well -- we saw 2.3x slower on ORT + CUDA EP compared to native PyTorch inference. In our case, the model is a large complicated diffusion model. I have a hunch that it's caused by the exported onnx graph having too many glue operators when exporting.

Could people from Microsoft step out and confirm whether it's the case? Thanks

thomas-beznik · 2022-09-07T18:35:23Z

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

I've made a new issue for our problem as it might not be fully related to this one, since I cannot reproduce the same slow-down with the ResNet model

tianleiwu · 2022-09-27T20:32:45Z

@CanyonWind &thomas-beznik, I recommend nvprof tool to profile your model. It will tell you which kernel uses most kernel time. For example, older version of huggingface stable diffusion pipeline, the UNet model has ScatterND, which is very slow in ORT. It is able to modify the modeling code to avoid those slow operators.

For diffusion model with Conv operator, I suggest try cudnn_conv_use_max_workspace in CUDA provider option like this. Sometime, it could make big difference.

In my benchmark of huggingface stable diffusion pipeline, if you use original ONNX FP32 model, it could be 2x slower. If you combine fp16 and conv algo search, ORT could be 25% faster than Pytorch:

Latency (Seconds per Query)	T4	V100	A100
PyTorch FP16	12.8	5.1	3.1
ORT FP32	28.4	10.1	7.2
ORT FP16	13.1	6.4	4.3
ORT FP16 + Conv Algo Search	9.6	3.8	2.4

tianleiwu · 2022-09-27T21:15:30Z

@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output:

It seems that ORT is much faster than PyTorch in V100.

Here is the script:

import os
import torch
import onnx
import torchvision.models as models
import onnxruntime
import time
import numpy

batch_size = 128
total_samples = 1000
device = torch.device('cuda:0')

def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}
               )

def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device).contiguous()

   so = onnxruntime.SessionOptions()
   exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   #IOBinding see https://onnxruntime.ai/docs/api/python/api_summary.html for more info
   input_name = ort_session.get_inputs()[0].name
   output_name = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_input(
    name=input_name,
    device_type='cuda',
    device_id=0,
    element_type=numpy.float32,
    shape=tuple(x.shape),
    buffer_ptr=x.data_ptr(),
   )

   io_binding.bind_output(output_name, 'cuda')

   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

davidmezzetti · 2022-11-29T10:54:10Z

Similar issue described here - #12880 (comment)

ma-xu · 2023-01-27T03:33:09Z

@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch.

tianleiwu · 2023-02-22T23:42:36Z

@ma-xu, what environment (PyTorch, CUDA, cuDNN verison and GPU) did you find ONNX is much slower than PyTorch?

Onnx Runtime and PyTorch both use cuDNN as backend for convolution. I would expect the latency is close for large batch size. If you find significant difference for large batch size, that usually indicates some integration issue.

I did a test of batch size 128 in V100 with CUDA 11.7, PyTorch 1.13.1+cu117, ORT 1.14.0. ORT is still faster:

Onnxruntime Inference
==========================

Number of runs: 1000
Average onnxruntime cuda Inference time = 24.17 ms
Pytorch Inference
==========================

Number of runs: 1000
Average PyTorch cuda Inference time = 30.39 ms

ma-xu · 2023-02-23T06:14:24Z

@tianleiwu Thanks for your reply. It may be caused my some onnx errors, I uploaded the output here. Hope this can help.

My configure is:
system_verision_python 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0]
system_verision_pytorch 1.12.1
system_verision_timm 0.6.12
system_verision_cuda 11.3
system_verision_cudnn 8302
system_verision_onnx 1.13.0
system_verision_onnxruntime 1.13.1
system_verision_tensorrt 8.5.2.2
system_gpu_name NVIDIA A100-SXM4-40GB
system_cpu_arch X86_64
system_cpu_brand_raw AMD EPYC 7742 64-Core Processor

benchmark_out.txt

cminnoy · 2023-10-09T22:06:53Z

I have an exported model from Tensorflow to onnx. The onnx runtime using DirectML is also very slow on RDNA2 platform. It is about 3 times faster than CPU, but that is really slow. You can even see that the GPU is not utilised properly as the power consumption of the card won't go up.

HarperIcey · 2024-07-04T06:37:48Z

I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!

edgchen1 added ep:CUDA issues related to the CUDA execution provider type:performance labels Jan 18, 2022

sophies927 added core runtime issues related to core runtime and removed type:performance labels Aug 12, 2022

thomas-beznik mentioned this issue Sep 7, 2022

ONNX Runtime much slower than PyTorch (2-3x slower) #12880

Open

yufenglee mentioned this issue Aug 11, 2023

[Performance] Pytorch Model converted to ONNX with CUDAProvider run slower 3x time than Using Pytorch with GPU #17116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

nssrivathsa commented Jan 17, 2022 •

edited

Loading

adepierre commented Jan 17, 2022 •

edited

Loading

nssrivathsa commented Jan 17, 2022

adepierre commented Jan 17, 2022 •

edited

Loading

nssrivathsa commented Jan 17, 2022

adepierre commented Jan 17, 2022

edgchen1 commented Jan 19, 2022

tianleiwu commented Jan 21, 2022

nssrivathsa commented Jan 24, 2022

nssrivathsa commented Jan 24, 2022

edgchen1 commented Jan 24, 2022

nssrivathsa commented Mar 3, 2022

tianleiwu commented Jun 24, 2022

nssrivathsa commented Jun 30, 2022

yunjiangster commented Jul 1, 2022

nssrivathsa commented Jul 5, 2022

thomas-beznik commented Sep 7, 2022 •

edited

Loading

CanyonWind commented Sep 7, 2022

thomas-beznik commented Sep 7, 2022

tianleiwu commented Sep 27, 2022 •

edited

Loading

tianleiwu commented Sep 27, 2022 •

edited

Loading

davidmezzetti commented Nov 29, 2022

ma-xu commented Jan 27, 2023 •

edited

Loading

tianleiwu commented Feb 22, 2023 •

edited

Loading

ma-xu commented Feb 23, 2023

cminnoy commented Oct 9, 2023 •

edited

Loading

HarperIcey commented Jul 4, 2024

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

Comments

nssrivathsa commented Jan 17, 2022 • edited Loading

adepierre commented Jan 17, 2022 • edited Loading

nssrivathsa commented Jan 17, 2022

adepierre commented Jan 17, 2022 • edited Loading

nssrivathsa commented Jan 17, 2022

adepierre commented Jan 17, 2022

edgchen1 commented Jan 19, 2022

tianleiwu commented Jan 21, 2022

nssrivathsa commented Jan 24, 2022

nssrivathsa commented Jan 24, 2022

edgchen1 commented Jan 24, 2022

nssrivathsa commented Mar 3, 2022

tianleiwu commented Jun 24, 2022

nssrivathsa commented Jun 30, 2022

yunjiangster commented Jul 1, 2022

nssrivathsa commented Jul 5, 2022

thomas-beznik commented Sep 7, 2022 • edited Loading

CanyonWind commented Sep 7, 2022

thomas-beznik commented Sep 7, 2022

tianleiwu commented Sep 27, 2022 • edited Loading

tianleiwu commented Sep 27, 2022 • edited Loading

davidmezzetti commented Nov 29, 2022

ma-xu commented Jan 27, 2023 • edited Loading

tianleiwu commented Feb 22, 2023 • edited Loading

ma-xu commented Feb 23, 2023

cminnoy commented Oct 9, 2023 • edited Loading

HarperIcey commented Jul 4, 2024

nssrivathsa commented Jan 17, 2022 •

edited

Loading

adepierre commented Jan 17, 2022 •

edited

Loading

adepierre commented Jan 17, 2022 •

edited

Loading

thomas-beznik commented Sep 7, 2022 •

edited

Loading

tianleiwu commented Sep 27, 2022 •

edited

Loading

tianleiwu commented Sep 27, 2022 •

edited

Loading

ma-xu commented Jan 27, 2023 •

edited

Loading

tianleiwu commented Feb 22, 2023 •

edited

Loading

cminnoy commented Oct 9, 2023 •

edited

Loading