You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider optimizing the FastAPI/OpenAI API server in vLLM as the server is widely used and seems to have a lot of overhead. On 1xA100 Llama 13B, the LLM class reaches 90~100% GPU utilization, while the API server can only utilize 50%
Hi @imoneoi, thanks for letting us know the issue. We've observed some slowdown when using the API server, but didn't know the slowdown is such significant. Will investigate it.
Consider optimizing the FastAPI/OpenAI API server in vLLM as the server is widely used and seems to have a lot of overhead. On 1xA100 Llama 13B, the
LLM
class reaches 90~100% GPU utilization, while the API server can only utilize 50%Related: #459
The text was updated successfully, but these errors were encountered: