Consider optimizing the API server #580

imoneoi · 2023-07-26T04:37:20Z

Consider optimizing the FastAPI/OpenAI API server in vLLM as the server is widely used and seems to have a lot of overhead. On 1xA100 Llama 13B, the LLM class reaches 90~100% GPU utilization, while the API server can only utilize 50%

Related: #459

The text was updated successfully, but these errors were encountered:

imoneoi · 2023-07-26T04:38:36Z

Also, it would be nice to add a throughput benchmark for the API server

WoosukKwon · 2023-07-26T04:44:22Z

Hi @imoneoi, thanks for letting us know the issue. We've observed some slowdown when using the API server, but didn't know the slowdown is such significant. Will investigate it.

hmellor · 2024-03-08T10:48:12Z

@WoosukKwon has this investigation happened yet? If yes, is there any discussion/issue/PR you can link to?

zhuohan123 added the bug Something isn't working label Aug 7, 2023

zhuohan123 mentioned this issue Aug 7, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

DarkLight1337 added performance Performance-related issues and removed bug Something isn't working labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider optimizing the API server #580

Consider optimizing the API server #580

imoneoi commented Jul 26, 2023

imoneoi commented Jul 26, 2023

WoosukKwon commented Jul 26, 2023

hmellor commented Mar 8, 2024

Consider optimizing the API server #580

Consider optimizing the API server #580

Comments

imoneoi commented Jul 26, 2023

imoneoi commented Jul 26, 2023

WoosukKwon commented Jul 26, 2023

hmellor commented Mar 8, 2024