Skip to content

ML server takes too much time to predict, what should be done in production and API #2338

Answered by parano
aliebrahiiimi asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @aliebrahiiimi - it sounds like the prediction task it self is taking significant amount of time, and if you also require low latency for this use case, the only solution would be vertically or horizontally scale your inference workload.

Vertically: consider using a more powerful host server with more CPU core or GPU.

Horizontally: BentoML & Yatai deployment allow user to create an auto-scaling deployment on Kubernetes that creates many replicas of model server workers to handle large scale workload. However it is mainly designed for handling large amount of requests, not one prediction request with a large number of prediction tasks. I think this is a very interesting scenario where i…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by timliubentoml
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants