-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Inference sessions slows down inference speed #5363
Comments
Looks like having two models running in one process doubled the load for onnxruntime within which there are only a fixed number of threads for parallelism... |
Yes I tried complete uninstallation and installed with --no-cache command. Here two models are started in different processes only. This is the output when two models are instantiated.
Also in the activity monitor we could see two processes utilizing 350%+ CPU, but I am not sure why the memory taken up by that process is 550mb coz the model size is 265mb only. Also this timing issue is not happening while using plain torch model. |
onnxruntime 1.5.1 on Mac OS need installing OpenMP. That's a new dependency, which shall resolve the import issue (similar to issue: #5344). |
@Arjunsankarlal |
Hi @RandySheriff , I can get your idea of instantiating the model at the function level, but again for every call I can't do the instantiation which would be time costly as the model takes some 3-4 secs for loading. Also I think I wasn't clear explaining what my problem was. So while using a single instantiation, I was calling with the same input only after I get the output for the previous request. And again for two instances, I call predict_onnx once for both the models and only after both complete the process I am sending the next requests. So at a given time a single inference session will have to process one input only and each of them have their own resources to work on. I will share some detailed stats and an example in colab may be for you to try in sometime. |
@Arjunsankarlal: now I have some statistics - the Ray actors with ORT session do exhibit some non-deterministic parallelism: Round 1: Round 2:
By test, at least we know that ORT session could go with Ray for better distribution, it is just the extent to which different procs are parallelized varies by many factors. |
@RandySheriff Well I guess the time difference between predictions could be because of model loading time and predictions start earlier and wait for the model to get loaded, try running for more number of predictions to some average range. Please find the example code here and onnx conversion for the model in the example here. Also I would suggest you to run on your device as in colab by default you will get two core CPU. And regarding the stats, Torch Stats: ONNX Stats: Let me know if you have any clarifications! Thanks! |
So does it show the parallelized inferencing for onnx models with Ray? |
@Arjunsankarlal: |
Closing the issue, but welcome to come back with more details. |
@Arjunsankarlal , were you able to identify the issue? I'm facing the same problem. I need to have two onnx runtime sessions, one for each type of inference (let's say one is a pytorch model that predicts text sentiment and the other is a pytorch model that predicts text class from a multi-class model). If I load only one session at a time and run inference, my inference times for each are, on average, x and y. But if I load both sessions and run inferences when both are loaded, my inference times for each are, on average, 2x and 2y. Why should loading the 2nd runtime session double the inference time of the first? Any clarity is appreciated. |
@chetan-bhat @Arjunsankarlal I'm having the exact same problem. did you guys figure it out? |
It's very strange, it seems like in ONNXRuntime, the relationship of each session is not independent. |
@hy846130226, Two sessions compete resources (CPU cores, memory etc) of same OS. You might set cpu affinity for each process (session) to pin them to different CPU cores (and configure number of threads in session option properly) and plan memory usage as well. |
I am using Ray and ONNX for serving my deep learning model, to process more inputs by effectively utilizing the CPU cores.. I have converted my model to onnx format.
Ray Actors runs on a separate process and can be assigned resources, but when no resource is mentioned it by default takes one CPU core. I created a single instance and ran the predictions for a same input for testing purposes,
Here the time taken for a single inference is around 230-300ms and the CPU usage is around 350% which is quite weird coz ray should have assigned a single CPU for it. Seems like ONNX is deciding based on the available CPU resources, but that is not the main problem here. The issue I face is when I create two instances of the Actor, and ran the same script with the following little modifications,
Now the inference time for every prediction is double the single inference time around 500-600ms. I have attached two images of the processes running for single instance and double instance.
For single actor model,
For double actor model,
What I expect here is as I have created two instance with enough resources, why does it takes more time when two instances are created. Also by the way, my system has just 4 cores, and I assume with hyperthreading it is using around 7 cores of CPU computation when running two instances.
Kindly correct me if I am wrong with the approach/understanding for using multiple runtime sessions. Also if you could share some examples it would be great!
Also I tried with the plain pytorch model before converting it to ONNX format and there I am getting the expected inference timings. Like the inference time for a single prediction is almost the same(around 450ms) while using both single and multiple Actor instances. So I guess it is not because of ray and because of ONNX or I might be doing something wrong. But as far as I searched, seems like multiple inference sessions are possible. Kindly let me know if there is a better way of achieving this.
System information
And I tried with the latest ONNX Runtime version 1.5.1, but I got the following error,
Any help would be much appreciated! TIA!
The text was updated successfully, but these errors were encountered: