ML server takes too much time to predict, what should be done in production and API #2338

aliebrahiiimi · 2022-03-05T10:20:31Z

aliebrahiiimi
Mar 5, 2022

My ML prediction server receives about 10000 data points in one request, and after prediction, it creates a plot. The server takes about 10 minutes to plot the final result. How can this time issue between the server and the client be resolved to prevent problems? I want to implement my API in a good way

Note: I know I can use a timeout to handle this problem, but if you have another solution, please let me know.

Answered by parano

Mar 5, 2022

Hi @aliebrahiiimi - it sounds like the prediction task it self is taking significant amount of time, and if you also require low latency for this use case, the only solution would be vertically or horizontally scale your inference workload.

Vertically: consider using a more powerful host server with more CPU core or GPU.

Horizontally: BentoML & Yatai deployment allow user to create an auto-scaling deployment on Kubernetes that creates many replicas of model server workers to handle large scale workload. However it is mainly designed for handling large amount of requests, not one prediction request with a large number of prediction tasks. I think this is a very interesting scenario where i…

View full answer

parano · 2022-03-05T20:25:57Z

parano
Mar 5, 2022
Maintainer

Hi @aliebrahiiimi - it sounds like the prediction task it self is taking significant amount of time, and if you also require low latency for this use case, the only solution would be vertically or horizontally scale your inference workload.

Vertically: consider using a more powerful host server with more CPU core or GPU.

Horizontally: BentoML & Yatai deployment allow user to create an auto-scaling deployment on Kubernetes that creates many replicas of model server workers to handle large scale workload. However it is mainly designed for handling large amount of requests, not one prediction request with a large number of prediction tasks. I think this is a very interesting scenario where ideally BentoML should split the large batch input into smaller batches and run them on multiple workers. This is not currently on our roadmap but seems like something we could easily support in Yatai's Kubernetes deployment.

As for now, a workaround I'd suggest is splitting your 10,000 data points into multiple prediction requests. For example you can do 10 prediction requests each with 1000 data points. You can then setup the client to send all 10 requests at the same time and have 10 model server replicas on the server side, the latency could potentially go down to about 1 minute. However you may need to aggregate the results on the client side for plotting.

0 replies

aliebrahiiimi · 2022-03-07T08:26:38Z

aliebrahiiimi
Mar 7, 2022
Author

thank you @parano, I handle it with a 2-step API.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

ML server takes too much time to predict, what should be done in production and API #2338

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

BentoML

ML server takes too much time to predict, what should be done in production and API #2338

aliebrahiiimi Mar 5, 2022

Replies: 2 comments

parano Mar 5, 2022 Maintainer

aliebrahiiimi Mar 7, 2022 Author

aliebrahiiimi
Mar 5, 2022

parano
Mar 5, 2022
Maintainer

aliebrahiiimi
Mar 7, 2022
Author