Multiple Inference sessions slows down inference speed #5363

Arjunsankarlal · 2020-10-02T16:29:05Z

I am using Ray and ONNX for serving my deep learning model, to process more inputs by effectively utilizing the CPU cores.. I have converted my model to onnx format.

@ray.remote
class NNModel:
    def __init__(self):
        try:
            self.ort_session = ort.InferenceSession("path/of/my/model")
            logger.warning('model instanciated!')
            self.warmup_model()
        except Exception as e:
            logger.warning(f"Exception while model instanciating : {e}")

    @ray.method(num_return_vals=1)
    def predict_onnx(self, data):
        input1, input2 = data
        ort_inputs = {
            'input1': input1.cpu().numpy(),
            'input2': input2.cpu().numpy(),
        }
        try:
            st = perf_counter()
            prediction = self.ort_session.run(None, ort_inputs)
            print(f'Time taken for inferencing is {perf_counter()-st}')
            return prediction
        except Exception as e:
            logger.warning(f"Exception while prediction : {e}")
            return None

Ray Actors runs on a separate process and can be assigned resources, but when no resource is mentioned it by default takes one CPU core. I created a single instance and ran the predictions for a same input for testing purposes,

model_instance = NNModel.remote()
print('Enter to start')
x = input()
while True:
    st = perf_counter()
    data = [input1, input2]
    out = model_instance.predict_onnx.remote(data)
    out = ray.get(out)

Here the time taken for a single inference is around 230-300ms and the CPU usage is around 350% which is quite weird coz ray should have assigned a single CPU for it. Seems like ONNX is deciding based on the available CPU resources, but that is not the main problem here. The issue I face is when I create two instances of the Actor, and ran the same script with the following little modifications,

model_instances = [NNModel.remote() for _ in range(0,2)]
print('Enter to start')
x = input()
while True:
    st = perf_counter()
    data = [input1, input2]
    for ins in model_instances:
        out.append(ins.predict_onnx.remote(data))
    for _ in out:
        ray.get(i)

Now the inference time for every prediction is double the single inference time around 500-600ms. I have attached two images of the processes running for single instance and double instance.

For single actor model,

For double actor model,

What I expect here is as I have created two instance with enough resources, why does it takes more time when two instances are created. Also by the way, my system has just 4 cores, and I assume with hyperthreading it is using around 7 cores of CPU computation when running two instances.

Kindly correct me if I am wrong with the approach/understanding for using multiple runtime sessions. Also if you could share some examples it would be great!

Also I tried with the plain pytorch model before converting it to ONNX format and there I am getting the expected inference timings. Like the inference time for a single prediction is almost the same(around 450ms) while using both single and multiple Actor instances. So I guess it is not because of ray and because of ONNX or I might be doing something wrong. But as far as I searched, seems like multiple inference sessions are possible. Kindly let me know if there is a better way of achieving this.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS
ONNX Runtime installed from (source or binary): Pip installed
ONNX Runtime version: 1.4.0
Python version: 3.7.6

And I tried with the latest ONNX Runtime version 1.5.1, but I got the following error,

/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py:14: UserWarning: Cannot load onnxruntime.capi. Error: 'dlopen(/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so, 2): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so
  Reason: image not found'.
  warnings.warn("Cannot load onnxruntime.capi. Error: '{0}'.".format(str(e)))
Traceback (most recent call last):
  File "/Users/arjun/bot/ray_onnx_test.py", line 4, in <module>
    import onnxruntime as ort
  File "/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/__init__.py", line 13, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, ImportError: cannot import name 'get_all_providers' from 'onnxruntime.capi._pybind_state' (/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py)

Any help would be much appreciated! TIA!

The text was updated successfully, but these errors were encountered:

RandySheriffH · 2020-10-02T17:03:46Z

Looks like having two models running in one process doubled the load for onnxruntime within which there are only a fixed number of threads for parallelism...
For errors regarding 1.5.1, did u do a complete uninstallation?

Arjunsankarlal · 2020-10-02T17:23:06Z

Yes I tried complete uninstallation and installed with --no-cache command. Here two models are started in different processes only. This is the output when two models are instantiated.

(pid=31693) Time taken for inferencing is 0.5958569639999993
(pid=31687) Time taken for inferencing is 0.9855443850000007
(pid=31693) Time taken for inferencing is 1.0075930720000006
(pid=31687) Time taken for inferencing is 0.6420220040000011
(pid=31693) Time taken for inferencing is 0.6305713330000007
(pid=31687) Time taken for inferencing is 0.6966333460000005
(pid=31693) Time taken for inferencing is 0.6890254020000004

Also in the activity monitor we could see two processes utilizing 350%+ CPU, but I am not sure why the memory taken up by that process is 550mb coz the model size is 265mb only. Also this timing issue is not happening while using plain torch model.

tianleiwu · 2020-10-02T18:44:03Z

onnxruntime 1.5.1 on Mac OS need installing OpenMP. That's a new dependency, which shall resolve the import issue (similar to issue: #5344).

RandySheriffH · 2020-10-08T00:30:21Z

@Arjunsankarlal
I am playing with your case - having a class hosting an ORT session while letting a member function as outlets for Ray to parallelize. It seems somehow the two Ray actors are synchronized. Would u mind share your script of pytorch actors?

Arjunsankarlal · 2020-10-08T06:02:41Z

Hi @RandySheriff , I can get your idea of instantiating the model at the function level, but again for every call I can't do the instantiation which would be time costly as the model takes some 3-4 secs for loading. Also I think I wasn't clear explaining what my problem was. So while using a single instantiation, I was calling with the same input only after I get the output for the previous request. And again for two instances, I call predict_onnx once for both the models and only after both complete the process I am sending the next requests. So at a given time a single inference session will have to process one input only and each of them have their own resources to work on. I will share some detailed stats and an example in colab may be for you to try in sometime.

RandySheriffH · 2020-10-08T18:34:58Z

@Arjunsankarlal: now I have some statistics - the Ray actors with ORT session do exhibit some non-deterministic parallelism:

Round 1:
done with 1 models in 0.109375 secs
done with 2 models in 0.015625 secs
done with 3 models in 0.015625 secs

Round 2:
done with 1 models in 0.046875 secs
done with 2 models in 0.03125 secs
done with 3 models in 0.015625 secs

input_tensor = np.asarray(cv2.resize(cv2.imread('image1.jpg'),(416,416))).reshape((1,416,416,3)).astype(np.float32)

def get_input_tensor():
    input_tensor[0][0][0][0] += 1.23e-3
    return input_tensor

@ray.remote
class YoloV4:
    def __init__(self):
        self.sess = ort.InferenceSession('./yolov4.onnx')
    @ray.method(num_returns=1)
    def infer(self, tensor):
        return self.sess.run(None, {'input_1:0': tensor})

def test(num_of_models):
    models = [YoloV4.remote() for _ in range(num_of_models)]
    start_at = time.process_time()
    futures = [model.infer.remote(get_input_tensor()) for model in models]
    ray.get(futures)
    print ('done with', len(futures), 'models in', time.process_time() - start_at, 'secs')

ray.init()
test(1)
test(2)
test(3)
input('done, enter to exit...')

By test, at least we know that ORT session could go with Ray for better distribution, it is just the extent to which different procs are parallelized varies by many factors.

Arjunsankarlal · 2020-10-09T08:07:38Z

@RandySheriff Well I guess the time difference between predictions could be because of model loading time and predictions start earlier and wait for the model to get loaded, try running for more number of predictions to some average range.
Anyway you can try the example I have shared below, please start the predictions once the models are instantiated and warmed up, so that you could see some stable numbers. And regarding the configurations for torch-onnx switch and CPU cores and model instances, find them at the top of the file. Sorry if the example code is bad! 😅

Please find the example code here and onnx conversion for the model in the example here. Also I would suggest you to run on your device as in colab by default you will get two core CPU.

And regarding the stats,
MacBook Pro (2.9 GHz Intel Core i7 16 GB RAM):

Torch Stats:
One Model One CPU each 0.79 - 0.90 secs (time ranging between)
One Model Four CPU each 0.38 - 0.42 secs
Two Model One CPU each 0.82 - 1.4 secs
Two Model Two CPU each 0.58 - 0.70 secs
Two Model Four CPU each 0.62 - 0.72 secs

ONNX Stats:
One Model One CPU 0.64 - 0.80 secs
One Model Four CPU 0.25 secs - 0.28 secs
Two Model One CPU each 0.68 - 0.88 secs
Two Model Four CPU each 0.47 - 0.60 secs

Let me know if you have any clarifications! Thanks!

RandySheriffH · 2020-10-09T17:46:13Z

@Arjunsankarlal

ONNX Stats:
One Model One CPU 0.64 - 0.80 secs
One Model Four CPU 0.25 secs - 0.28 secs
Two Model One CPU each 0.68 - 0.88 secs
Two Model Four CPU each 0.47 - 0.60 secs

So does it show the parallelized inferencing for onnx models with Ray?

RandySheriffH · 2020-10-12T20:33:10Z

@Arjunsankarlal:
Please correct me if misunderstood your statistics - indeed onnx session could be inferred with parallelism in Ray, and there is no notable difference from those of pytorch models.

RandySheriffH · 2020-10-13T22:10:19Z

Closing the issue, but welcome to come back with more details.

chetan-bhat · 2021-04-19T09:30:47Z

@Arjunsankarlal , were you able to identify the issue? I'm facing the same problem.

I need to have two onnx runtime sessions, one for each type of inference (let's say one is a pytorch model that predicts text sentiment and the other is a pytorch model that predicts text class from a multi-class model). If I load only one session at a time and run inference, my inference times for each are, on average, x and y. But if I load both sessions and run inferences when both are loaded, my inference times for each are, on average, 2x and 2y.

Why should loading the 2nd runtime session double the inference time of the first? Any clarity is appreciated.

thomas-happify · 2021-07-29T14:19:49Z

@chetan-bhat @Arjunsankarlal I'm having the exact same problem. did you guys figure it out?
Appreciate your help

hy846130226 · 2023-10-07T07:25:21Z

It's very strange, it seems like in ONNXRuntime, the relationship of each session is not independent.

tianleiwu · 2023-10-09T18:01:06Z

@hy846130226, Two sessions compete resources (CPU cores, memory etc) of same OS. You might set cpu affinity for each process (session) to pin them to different CPU cores (and configure number of threads in session option properly) and plan memory usage as well.

Arjunsankarlal changed the title ~~Multiple Inference Session slows down prediction speed~~ Multiple Inference sessions slows down inference speed Oct 2, 2020

RandySheriffH added the type:performance label Oct 2, 2020

RandySheriffH self-assigned this Oct 5, 2020

RandySheriffH closed this as completed Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Inference sessions slows down inference speed #5363

Multiple Inference sessions slows down inference speed #5363

Arjunsankarlal commented Oct 2, 2020 •

edited

Loading

RandySheriffH commented Oct 2, 2020

Arjunsankarlal commented Oct 2, 2020

tianleiwu commented Oct 2, 2020 •

edited

Loading

RandySheriffH commented Oct 8, 2020 •

edited

Loading

Arjunsankarlal commented Oct 8, 2020

RandySheriffH commented Oct 8, 2020 •

edited

Loading

Arjunsankarlal commented Oct 9, 2020 •

edited

Loading

RandySheriffH commented Oct 9, 2020 •

edited

Loading

RandySheriffH commented Oct 12, 2020

RandySheriffH commented Oct 13, 2020

chetan-bhat commented Apr 19, 2021

thomas-happify commented Jul 29, 2021

hy846130226 commented Oct 7, 2023

tianleiwu commented Oct 9, 2023

Multiple Inference sessions slows down inference speed #5363

Multiple Inference sessions slows down inference speed #5363

Comments

Arjunsankarlal commented Oct 2, 2020 • edited Loading

RandySheriffH commented Oct 2, 2020

Arjunsankarlal commented Oct 2, 2020

tianleiwu commented Oct 2, 2020 • edited Loading

RandySheriffH commented Oct 8, 2020 • edited Loading

Arjunsankarlal commented Oct 8, 2020

RandySheriffH commented Oct 8, 2020 • edited Loading

Arjunsankarlal commented Oct 9, 2020 • edited Loading

RandySheriffH commented Oct 9, 2020 • edited Loading

RandySheriffH commented Oct 12, 2020

RandySheriffH commented Oct 13, 2020

chetan-bhat commented Apr 19, 2021

thomas-happify commented Jul 29, 2021

hy846130226 commented Oct 7, 2023

tianleiwu commented Oct 9, 2023

Arjunsankarlal commented Oct 2, 2020 •

edited

Loading

tianleiwu commented Oct 2, 2020 •

edited

Loading

RandySheriffH commented Oct 8, 2020 •

edited

Loading

RandySheriffH commented Oct 8, 2020 •

edited

Loading

Arjunsankarlal commented Oct 9, 2020 •

edited

Loading

RandySheriffH commented Oct 9, 2020 •

edited

Loading