Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider optimizing the API server #580

Open
imoneoi opened this issue Jul 26, 2023 · 3 comments
Open

Consider optimizing the API server #580

imoneoi opened this issue Jul 26, 2023 · 3 comments
Labels
performance Performance-related issues

Comments

@imoneoi
Copy link
Contributor

imoneoi commented Jul 26, 2023

Consider optimizing the FastAPI/OpenAI API server in vLLM as the server is widely used and seems to have a lot of overhead. On 1xA100 Llama 13B, the LLM class reaches 90~100% GPU utilization, while the API server can only utilize 50%

Related: #459

@imoneoi
Copy link
Contributor Author

imoneoi commented Jul 26, 2023

Also, it would be nice to add a throughput benchmark for the API server

@WoosukKwon
Copy link
Collaborator

Hi @imoneoi, thanks for letting us know the issue. We've observed some slowdown when using the API server, but didn't know the slowdown is such significant. Will investigate it.

@zhuohan123 zhuohan123 added the bug Something isn't working label Aug 7, 2023
@hmellor
Copy link
Collaborator

hmellor commented Mar 8, 2024

@WoosukKwon has this investigation happened yet? If yes, is there any discussion/issue/PR you can link to?

@DarkLight1337 DarkLight1337 added performance Performance-related issues and removed bug Something isn't working labels May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

5 participants