ML server takes too much time to predict, what should be done in production and API #2338
-
My ML prediction server receives about 10000 data points in one request, and after prediction, it creates a plot. The server takes about 10 minutes to plot the final result. How can this time issue between the server and the client be resolved to prevent problems? I want to implement my API in a good way Note: I know I can use a timeout to handle this problem, but if you have another solution, please let me know. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @aliebrahiiimi - it sounds like the prediction task it self is taking significant amount of time, and if you also require low latency for this use case, the only solution would be vertically or horizontally scale your inference workload. Vertically: consider using a more powerful host server with more CPU core or GPU. Horizontally: BentoML & Yatai deployment allow user to create an auto-scaling deployment on Kubernetes that creates many replicas of model server workers to handle large scale workload. However it is mainly designed for handling large amount of requests, not one prediction request with a large number of prediction tasks. I think this is a very interesting scenario where ideally BentoML should split the large batch input into smaller batches and run them on multiple workers. This is not currently on our roadmap but seems like something we could easily support in Yatai's Kubernetes deployment. As for now, a workaround I'd suggest is splitting your 10,000 data points into multiple prediction requests. For example you can do 10 prediction requests each with 1000 data points. You can then setup the client to send all 10 requests at the same time and have 10 model server replicas on the server side, the latency could potentially go down to about 1 minute. However you may need to aggregate the results on the client side for plotting. |
Beta Was this translation helpful? Give feedback.
-
thank you @parano, I handle it with a 2-step API. |
Beta Was this translation helpful? Give feedback.
Hi @aliebrahiiimi - it sounds like the prediction task it self is taking significant amount of time, and if you also require low latency for this use case, the only solution would be vertically or horizontally scale your inference workload.
Vertically: consider using a more powerful host server with more CPU core or GPU.
Horizontally: BentoML & Yatai deployment allow user to create an auto-scaling deployment on Kubernetes that creates many replicas of model server workers to handle large scale workload. However it is mainly designed for handling large amount of requests, not one prediction request with a large number of prediction tasks. I think this is a very interesting scenario where i…