Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Inference sessions slows down inference speed #5363

Closed
Arjunsankarlal opened this issue Oct 2, 2020 · 14 comments
Closed

Multiple Inference sessions slows down inference speed #5363

Arjunsankarlal opened this issue Oct 2, 2020 · 14 comments
Assignees

Comments

@Arjunsankarlal
Copy link

Arjunsankarlal commented Oct 2, 2020

I am using Ray and ONNX for serving my deep learning model, to process more inputs by effectively utilizing the CPU cores.. I have converted my model to onnx format.

@ray.remote
class NNModel:
    def __init__(self):
        try:
            self.ort_session = ort.InferenceSession("path/of/my/model")
            logger.warning('model instanciated!')
            self.warmup_model()
        except Exception as e:
            logger.warning(f"Exception while model instanciating : {e}")

    @ray.method(num_return_vals=1)
    def predict_onnx(self, data):
        input1, input2 = data
        ort_inputs = {
            'input1': input1.cpu().numpy(),
            'input2': input2.cpu().numpy(),
        }
        try:
            st = perf_counter()
            prediction = self.ort_session.run(None, ort_inputs)
            print(f'Time taken for inferencing is {perf_counter()-st}')
            return prediction
        except Exception as e:
            logger.warning(f"Exception while prediction : {e}")
            return None

Ray Actors runs on a separate process and can be assigned resources, but when no resource is mentioned it by default takes one CPU core. I created a single instance and ran the predictions for a same input for testing purposes,

model_instance = NNModel.remote()
print('Enter to start')
x = input()
while True:
    st = perf_counter()
    data = [input1, input2]
    out = model_instance.predict_onnx.remote(data)
    out = ray.get(out)

Here the time taken for a single inference is around 230-300ms and the CPU usage is around 350% which is quite weird coz ray should have assigned a single CPU for it. Seems like ONNX is deciding based on the available CPU resources, but that is not the main problem here. The issue I face is when I create two instances of the Actor, and ran the same script with the following little modifications,

model_instances = [NNModel.remote() for _ in range(0,2)]
print('Enter to start')
x = input()
while True:
    st = perf_counter()
    data = [input1, input2]
    for ins in model_instances:
        out.append(ins.predict_onnx.remote(data))
    for _ in out:
        ray.get(i)

Now the inference time for every prediction is double the single inference time around 500-600ms. I have attached two images of the processes running for single instance and double instance.

For single actor model,
Single Actor Model

For double actor model,
Double Actor Model

What I expect here is as I have created two instance with enough resources, why does it takes more time when two instances are created. Also by the way, my system has just 4 cores, and I assume with hyperthreading it is using around 7 cores of CPU computation when running two instances.

Kindly correct me if I am wrong with the approach/understanding for using multiple runtime sessions. Also if you could share some examples it would be great!

Also I tried with the plain pytorch model before converting it to ONNX format and there I am getting the expected inference timings. Like the inference time for a single prediction is almost the same(around 450ms) while using both single and multiple Actor instances. So I guess it is not because of ray and because of ONNX or I might be doing something wrong. But as far as I searched, seems like multiple inference sessions are possible. Kindly let me know if there is a better way of achieving this.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS
  • ONNX Runtime installed from (source or binary): Pip installed
  • ONNX Runtime version: 1.4.0
  • Python version: 3.7.6

And I tried with the latest ONNX Runtime version 1.5.1, but I got the following error,

/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py:14: UserWarning: Cannot load onnxruntime.capi. Error: 'dlopen(/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so, 2): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so
  Reason: image not found'.
  warnings.warn("Cannot load onnxruntime.capi. Error: '{0}'.".format(str(e)))
Traceback (most recent call last):
  File "/Users/arjun/bot/ray_onnx_test.py", line 4, in <module>
    import onnxruntime as ort
  File "/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/__init__.py", line 13, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, ImportError: cannot import name 'get_all_providers' from 'onnxruntime.capi._pybind_state' (/Users/arjun/testenv/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py)

Any help would be much appreciated! TIA!

@Arjunsankarlal Arjunsankarlal changed the title Multiple Inference Session slows down prediction speed Multiple Inference sessions slows down inference speed Oct 2, 2020
@RandySheriffH
Copy link
Contributor

Looks like having two models running in one process doubled the load for onnxruntime within which there are only a fixed number of threads for parallelism...
For errors regarding 1.5.1, did u do a complete uninstallation?

@Arjunsankarlal
Copy link
Author

Yes I tried complete uninstallation and installed with --no-cache command. Here two models are started in different processes only. This is the output when two models are instantiated.

(pid=31693) Time taken for inferencing is 0.5958569639999993
(pid=31687) Time taken for inferencing is 0.9855443850000007
(pid=31693) Time taken for inferencing is 1.0075930720000006
(pid=31687) Time taken for inferencing is 0.6420220040000011
(pid=31693) Time taken for inferencing is 0.6305713330000007
(pid=31687) Time taken for inferencing is 0.6966333460000005
(pid=31693) Time taken for inferencing is 0.6890254020000004

Also in the activity monitor we could see two processes utilizing 350%+ CPU, but I am not sure why the memory taken up by that process is 550mb coz the model size is 265mb only. Also this timing issue is not happening while using plain torch model.

@tianleiwu
Copy link
Contributor

tianleiwu commented Oct 2, 2020

onnxruntime 1.5.1 on Mac OS need installing OpenMP. That's a new dependency, which shall resolve the import issue (similar to issue: #5344).

@RandySheriffH RandySheriffH self-assigned this Oct 5, 2020
@RandySheriffH
Copy link
Contributor

RandySheriffH commented Oct 8, 2020

@Arjunsankarlal
I am playing with your case - having a class hosting an ORT session while letting a member function as outlets for Ray to parallelize. It seems somehow the two Ray actors are synchronized. Would u mind share your script of pytorch actors?

@Arjunsankarlal
Copy link
Author

Hi @RandySheriff , I can get your idea of instantiating the model at the function level, but again for every call I can't do the instantiation which would be time costly as the model takes some 3-4 secs for loading. Also I think I wasn't clear explaining what my problem was. So while using a single instantiation, I was calling with the same input only after I get the output for the previous request. And again for two instances, I call predict_onnx once for both the models and only after both complete the process I am sending the next requests. So at a given time a single inference session will have to process one input only and each of them have their own resources to work on. I will share some detailed stats and an example in colab may be for you to try in sometime.

@RandySheriffH
Copy link
Contributor

RandySheriffH commented Oct 8, 2020

@Arjunsankarlal: now I have some statistics - the Ray actors with ORT session do exhibit some non-deterministic parallelism:

Round 1:
done with 1 models in 0.109375 secs
done with 2 models in 0.015625 secs
done with 3 models in 0.015625 secs

Round 2:
done with 1 models in 0.046875 secs
done with 2 models in 0.03125 secs
done with 3 models in 0.015625 secs

input_tensor = np.asarray(cv2.resize(cv2.imread('image1.jpg'),(416,416))).reshape((1,416,416,3)).astype(np.float32)

def get_input_tensor():
    input_tensor[0][0][0][0] += 1.23e-3
    return input_tensor

@ray.remote
class YoloV4:
    def __init__(self):
        self.sess = ort.InferenceSession('./yolov4.onnx')
    @ray.method(num_returns=1)
    def infer(self, tensor):
        return self.sess.run(None, {'input_1:0': tensor})

def test(num_of_models):
    models = [YoloV4.remote() for _ in range(num_of_models)]
    start_at = time.process_time()
    futures = [model.infer.remote(get_input_tensor()) for model in models]
    ray.get(futures)
    print ('done with', len(futures), 'models in', time.process_time() - start_at, 'secs')

ray.init()
test(1)
test(2)
test(3)
input('done, enter to exit...')

By test, at least we know that ORT session could go with Ray for better distribution, it is just the extent to which different procs are parallelized varies by many factors.

@Arjunsankarlal
Copy link
Author

Arjunsankarlal commented Oct 9, 2020

@RandySheriff Well I guess the time difference between predictions could be because of model loading time and predictions start earlier and wait for the model to get loaded, try running for more number of predictions to some average range.
Anyway you can try the example I have shared below, please start the predictions once the models are instantiated and warmed up, so that you could see some stable numbers. And regarding the configurations for torch-onnx switch and CPU cores and model instances, find them at the top of the file. Sorry if the example code is bad! 😅

Please find the example code here and onnx conversion for the model in the example here. Also I would suggest you to run on your device as in colab by default you will get two core CPU.

And regarding the stats,
MacBook Pro (2.9 GHz Intel Core i7 16 GB RAM):

Torch Stats:
One Model One CPU each 0.79 - 0.90 secs (time ranging between)
One Model Four CPU each 0.38 - 0.42 secs
Two Model One CPU each 0.82 - 1.4 secs
Two Model Two CPU each 0.58 - 0.70 secs
Two Model Four CPU each 0.62 - 0.72 secs

ONNX Stats:
One Model One CPU 0.64 - 0.80 secs
One Model Four CPU 0.25 secs - 0.28 secs
Two Model One CPU each 0.68 - 0.88 secs
Two Model Four CPU each 0.47 - 0.60 secs

Let me know if you have any clarifications! Thanks!

@RandySheriffH
Copy link
Contributor

RandySheriffH commented Oct 9, 2020

@Arjunsankarlal

ONNX Stats:
One Model One CPU 0.64 - 0.80 secs
One Model Four CPU 0.25 secs - 0.28 secs
Two Model One CPU each 0.68 - 0.88 secs
Two Model Four CPU each 0.47 - 0.60 secs

So does it show the parallelized inferencing for onnx models with Ray?

@RandySheriffH
Copy link
Contributor

@Arjunsankarlal:
Please correct me if misunderstood your statistics - indeed onnx session could be inferred with parallelism in Ray, and there is no notable difference from those of pytorch models.

@RandySheriffH
Copy link
Contributor

Closing the issue, but welcome to come back with more details.

@chetan-bhat
Copy link

@Arjunsankarlal , were you able to identify the issue? I'm facing the same problem.

I need to have two onnx runtime sessions, one for each type of inference (let's say one is a pytorch model that predicts text sentiment and the other is a pytorch model that predicts text class from a multi-class model). If I load only one session at a time and run inference, my inference times for each are, on average, x and y. But if I load both sessions and run inferences when both are loaded, my inference times for each are, on average, 2x and 2y.

Why should loading the 2nd runtime session double the inference time of the first? Any clarity is appreciated.

@thomas-happify
Copy link

@chetan-bhat @Arjunsankarlal I'm having the exact same problem. did you guys figure it out?
Appreciate your help

@hy846130226
Copy link

It's very strange, it seems like in ONNXRuntime, the relationship of each session is not independent.

@tianleiwu
Copy link
Contributor

@hy846130226, Two sessions compete resources (CPU cores, memory etc) of same OS. You might set cpu affinity for each process (session) to pin them to different CPU cores (and configure number of threads in session option properly) and plan memory usage as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants