Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303

Open
nssrivathsa opened this issue Jan 17, 2022 · 26 comments
Open
Labels
core runtime issues related to core runtime ep:CUDA issues related to the CUDA execution provider

Comments

@nssrivathsa
Copy link

nssrivathsa commented Jan 17, 2022

Describe the bug
Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • ONNX Runtime installed from (source or binary): Source
  • ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
  • Python version: 3.8.12
  • CUDA/cuDNN version: cuda version 11.5, cudnn version 8.2
  • GPU model and memory: Quadro M2000M, 4 GB

To Reproduce

batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}        
               )
                  
def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)
   x = to_numpy(x)

   so = onnxruntime.SessionOptions()
   so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
   so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
   
   exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   options = ort_session.get_provider_options()
   cuda_options = options['CUDAExecutionProvider']
   cuda_options['cudnn_conv_use_max_workspace'] = '1'
   ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])

   #IOBinding
   input_names = ort_session.get_inputs()[0].name
   output_names = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_cpu_input(input_names, x)
   io_binding.bind_output(output_names, device)
   
   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

Current behavior

If run on CPU,

Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms

but, if run on GPU, I see

Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms

If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.

I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU.
When converting the resnet to onnx, I see traces like
%193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0),
so, the nodes of the model are on GPU.

Further, during the processing for onnxruntime, I print device usage stats and I see this -

Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached:    0.1 GB

So, GPU device is being used.
Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.

So, I cannot seem to figure this out any further and I am stuck here since quite a few days.
Could somebody please point out to what could be the issue here?

Thanks

@adepierre
Copy link

adepierre commented Jan 17, 2022

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

@nssrivathsa
Copy link
Author

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

Thanks for the reply
I tried what was mentioned in the link you provided by modifying my infer_pytorch(resnet) function as

   start = torch.cuda.Event(enable_timing=True)
   end = torch.cuda.Event(enable_timing=True)
   start.record()
   for i in range(total_samples):
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
   end.record()
   torch.cuda.synchronize()
   print('Number of runs:', total_samples)
   print("Average PyTorch {} Inference time = {} ms".format(device.type, start.elapsed_time(end)/total_samples))  

Unfortunately, still see not much of a difference in the pytorch inference time

Average PyTorch cuda Inference time = 9.928244140625 ms

The onnxruntime inference is still the same (around 48 ms), so its still around 5 times slower

@adepierre
Copy link

adepierre commented Jan 17, 2022

I think you need to put the record and the synchronize inside your loop. Something like that

with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

@nssrivathsa
Copy link
Author

I think you need to put the record and the synchronize inside your loop. Something like that

with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

Changed it the way you have suggested.
Still get it as,
Average PyTorch cuda Inference time = 9.747972169876098 ms

@adepierre
Copy link

Ok so it's probably not that

@edgchen1 edgchen1 added ep:CUDA issues related to the CUDA execution provider type:performance labels Jan 18, 2022
@edgchen1
Copy link
Contributor

Is the input actually on GPU? In the code you provided, I see:
io_binding.bind_cpu_input(input_names, x)

Some other things to try:
ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

@tianleiwu
Copy link
Contributor

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

@nssrivathsa
Copy link
Author

Is the input actually on GPU? In the code you provided, I see: io_binding.bind_cpu_input(input_names, x)

Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

@edgchen1

  1. Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding. Nevertheless, I have modified it as
data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

and I still see no difference in the timing

  1. I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

@nssrivathsa
Copy link
Author

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

@tianleiwu

As you can see for my above reply, io_binding.bind_cpu_input should also do what you are suggesting. Nevertheless, once I modify my code as -

data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

still the timings on GPU for 1000 samples are -

Average onnxruntime cuda Inference time = 44.55 ms
Average PyTorch cuda Inference time = 9.780553787231446 ms

@edgchen1
Copy link
Contributor

Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding.

The official docs are here: https://onnxruntime.ai/docs/api/python/api_summary.html

https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.IOBinding.bind_cpu_input
The input is on CPU so it will need to be copied to the GPU.

I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

@RandySheriffH is there some documentation about CUDA kernel profiling in ORT?

You can look for host to device memcpy's with nvprof to see whether the input is actually getting copied from CPU to GPU.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview
Visual profiler can be used to view the nvprof results.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual

@nssrivathsa
Copy link
Author

I could not find the issue with profiling, as everything seems to take more time
But, I observed that running it on a machine with a more advanced GPU ensures that onnxruntime is faster than Pytorch
For e.g. on a Tesla T4 or Tesla V100 GPU, onnxruntime is around 4 times faster than pytorch

@tianleiwu
Copy link
Contributor

@nssrivathsa, any update on this? Are you able to find the cause?
If you have nvprof results of ORT and PyTorch. You can compare the profiling result to see which cuda kernel or memory copy that ORT spent more time.

@nssrivathsa
Copy link
Author

@tianleiwu No, i am still not able to find the cause. But, like I mentioned, it does not happen when i run it with some advanced GPUs

@yunjiangster
Copy link

@nssrivathsa Even on Tesla T4, I am seeing 2x slow down using onnx versus pytorch + fp16.

@nssrivathsa
Copy link
Author

@yunjiangster I haven't used any quantization yet, the script is same as what I have shared above and with that I see improvements when i use it with Tesla T4 on AWS

@sophies927 sophies927 added core runtime issues related to core runtime and removed type:performance labels Aug 12, 2022
@thomas-beznik
Copy link

thomas-beznik commented Sep 7, 2022

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

@CanyonWind
Copy link

Seeing similar problems as well -- we saw 2.3x slower on ORT + CUDA EP compared to native PyTorch inference. In our case, the model is a large complicated diffusion model. I have a hunch that it's caused by the exported onnx graph having too many glue operators when exporting.

Could people from Microsoft step out and confirm whether it's the case? Thanks

@thomas-beznik
Copy link

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

I've made a new issue for our problem as it might not be fully related to this one, since I cannot reproduce the same slow-down with the ResNet model

@tianleiwu
Copy link
Contributor

tianleiwu commented Sep 27, 2022

@CanyonWind &thomas-beznik, I recommend nvprof tool to profile your model. It will tell you which kernel uses most kernel time. For example, older version of huggingface stable diffusion pipeline, the UNet model has ScatterND, which is very slow in ORT. It is able to modify the modeling code to avoid those slow operators.

For diffusion model with Conv operator, I suggest try cudnn_conv_use_max_workspace in CUDA provider option like this. Sometime, it could make big difference.

In my benchmark of huggingface stable diffusion pipeline, if you use original ONNX FP32 model, it could be 2x slower. If you combine fp16 and conv algo search, ORT could be 25% faster than Pytorch:

Latency (Seconds per Query) T4 V100 A100
PyTorch FP16 12.8 5.1 3.1
ORT FP32 28.4 10.1 7.2
ORT FP16 13.1 6.4 4.3
ORT FP16 + Conv Algo Search 9.6 3.8 2.4

@tianleiwu
Copy link
Contributor

tianleiwu commented Sep 27, 2022

@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output:
image

It seems that ORT is much faster than PyTorch in V100.

Here is the script:

import os
import torch
import onnx
import torchvision.models as models
import onnxruntime
import time
import numpy

batch_size = 128
total_samples = 1000
device = torch.device('cuda:0')

def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}
               )

def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device).contiguous()

   so = onnxruntime.SessionOptions()
   exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   #IOBinding see https://onnxruntime.ai/docs/api/python/api_summary.html for more info
   input_name = ort_session.get_inputs()[0].name
   output_name = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_input(
    name=input_name,
    device_type='cuda',
    device_id=0,
    element_type=numpy.float32,
    shape=tuple(x.shape),
    buffer_ptr=x.data_ptr(),
   )

   io_binding.bind_output(output_name, 'cuda')

   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

@davidmezzetti
Copy link

Similar issue described here - #12880 (comment)

@ma-xu
Copy link

ma-xu commented Jan 27, 2023

@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch.

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 22, 2023

@ma-xu, what environment (PyTorch, CUDA, cuDNN verison and GPU) did you find ONNX is much slower than PyTorch?

Onnx Runtime and PyTorch both use cuDNN as backend for convolution. I would expect the latency is close for large batch size. If you find significant difference for large batch size, that usually indicates some integration issue.

I did a test of batch size 128 in V100 with CUDA 11.7, PyTorch 1.13.1+cu117, ORT 1.14.0. ORT is still faster:

Onnxruntime Inference
==========================

Number of runs: 1000
Average onnxruntime cuda Inference time = 24.17 ms
Pytorch Inference
==========================

Number of runs: 1000
Average PyTorch cuda Inference time = 30.39 ms

@ma-xu
Copy link

ma-xu commented Feb 23, 2023

@tianleiwu Thanks for your reply. It may be caused my some onnx errors, I uploaded the output here. Hope this can help.

My configure is:
system_verision_python 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0]
system_verision_pytorch 1.12.1
system_verision_timm 0.6.12
system_verision_cuda 11.3
system_verision_cudnn 8302
system_verision_onnx 1.13.0
system_verision_onnxruntime 1.13.1
system_verision_tensorrt 8.5.2.2
system_gpu_name NVIDIA A100-SXM4-40GB
system_cpu_arch X86_64
system_cpu_brand_raw AMD EPYC 7742 64-Core Processor

benchmark_out.txt

@cminnoy
Copy link

cminnoy commented Oct 9, 2023

I have an exported model from Tensorflow to onnx. The onnx runtime using DirectML is also very slow on RDNA2 platform. It is about 3 times faster than CPU, but that is really slow. You can even see that the GPU is not utilised properly as the power consumption of the card won't go up.

@HarperIcey
Copy link

I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests