-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORTANT Bug: Model return empty response (output len = 0), when recieved multiple concurrent request. #3209
Comments
I don't think the model is returning empty responses - but rather these requests were just failing due to some error. Each Regarding rate limiter - I think that's a good point and it's also mentioned in #3127. I'm curious is there any reason why you're not setting |
Hi, thanks for your quick response.
And it is INSIDE the if statement:
(https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L248) It mean that even the response.status == 200, the model output empty text. I'm setting the request-rate to "inf" in the benchmark_serving.py to simulate concurrent user. But what I suggest is put the Rate Limiter directly to the API endpoint code in openai/api_server.py, |
We actually do when we calculate the metrics. vllm/benchmarks/benchmark_serving.py Line 133 in 05af6da
|
Yes I also saw that, but the problem is when calculate the metrics, if you just add zero in the output lens it just reduce the final average throughput token / s but it doesn't show any warning or error. |
That's true, but there are two separate issues we should be tracking here:
Error logging makes sense to me and I can add it to my current PR in progress #3194 |
thanks, would be great if you update what I identifed above to your PR of benchmark_serving.py. But the main point here is let see if we could know why the model return empty response and fix it. As far as I know, we already have a requests queue handle by Async-LLM-Engine here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L130 Or I can just put explicily another Rate Limit Queue decorator above the POST API at https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py#L178 |
That's correct, but for an empty output, how do you currently differentiate between a connection failure and the model actually generating nothing? |
So could it happen when response.status == 200 and the connection failure somehow ? I though the response status is use to determine that, no? |
What I'm saying is the vllm/benchmarks/backend_request_func.py Lines 270 to 271 in 05af6da
|
Good point, lets check how we can explicitly set this numbers. But I even replaced the aiohttp that currently used in benchmark_serving.py, I tried with OpenAI python package, and tried with HTTPX (which is actually used in the back of openai package), the empty response persist with different client package. And it happen with small number of concurrent requests as well, even 5 or 10 concurrent request, we will get 2 or 3 of them is empty response. |
Consistently facing the same issue, even without concurrent requests |
That's more alarming indeed. I'll incorporate some changes in my PR to try to eliminate the possibilities that these errors are caused by the benchmark script itself rather than the actual model server. |
I've added changes in #3194 to capture the request errors and actual output length and you can examine after in the result json afterwards by specifying the Command:
Results:
|
@ywang96 I see that you use another dataset named sonnet instead of ShareGPT, and the request per second is 7.63 instead of 1.72 as my result. If possible, could you run on ShareGPT with my configuration of 500 input and about 150 max output in each request? What your promp input & max output mean? |
@tattrongvu 550, 150 (you can also tell from dividing total inputs & outputs tokens by number of requests in the benchmark result print) If you take a look at #3194 and use that version of |
@ywang96 ok, lets me run your modified benchmark in the sonnet and also again in the ShareGPT. |
@ywang96 So, I just copy the "sample_sonnet_requests" function from your repo at https://github.com/ywang96/vllm/blob/add-prefix/benchmarks/benchmark_serving.py#L103 and put into my modifed benchmark as above and use it like that:
So from the result, the good new is it completed all 400 requests without empty responses. Could you use ShareGPT on your setup to confirm? Otherwise I see in my setup, I have 2.9 r/s instead of 7.2, maybe I should update to vllm 0.3.3 :) |
@tattrongvu @ywang96 guys any help? |
I identified the problem.
and then use above template to format the sampled prompt from ShareGPT dataset like that:
The models will try to summarize the document, hence not output empty response any more. |
@tattrongvu but i already pass my prompt to a prompt template before feeding it to my model. |
When I did a bunch of load test the vLLM endpoint with OpenAI API, I see that the server will return 20% to 50% empty responses when it recieved multiple concurrent request.
Configuration:
Number of concurrent request: 100, request rate set as default = "inf", meaning 100 request will be send concurrently.
How to replicate:
I'm using latest benchmark_serving.py (5.3.2024) with following modification:
I comment out L69-L72: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L69
and modify following:
and set the params like that:
Moreover, since in the current code of https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L265
doesn't check for the output len, I need to add 2 more line of code like this:
With above configuration, the success rate is 76 / 100, there are 24 request returned with empty response.
NOTE:
I tried replace the aiohttp with httpx, openai package as http client. The problem persist.
Moreover, the emtpy response happen even with small number of concurrent request. If we send 5 or 10 request concurrently, we will get 2, 3 empty responses.
I tested with Mixtral 8x7b, Llama-2-70b unquantized, the number of empty response even worse., up to 50%.
My suggestion is put a Rate Limiter (maybe just a decorator or await rate_limit() function) to the api_server.py to limit the request rate that server can handle to some predefine number like 10 request / 5 seconds,...I could provide a MR later if needed.
Please take a look and let me know if something wrong here.
Thanks in advance.
The text was updated successfully, but these errors were encountered: