-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] sglang run for few hours, it will stop returning valid response #1270
Comments
Just a bit curious, why is there a vllm in the prefix of each line before the log? |
@zhyncs vllm.sh
pm2 start vllm.sh, hence the prefix showing vllm, but its was actually running sglang server |
Okay, this is an issue worth paying attention to. We place great importance on stability. Can you provide more detailed steps to reproduce the issue? Because we haven't encountered a similar problem online before. |
same issue happend on docker compose
|
can try run this script for a whole day to reproduce with --tp 8 |
I think this is similar to what we've seen previously related to
Last time I mentioned using @liho00 Could you try to use |
Hi Im using 8xH100, I need the maximum performance, hence i dont think im going to use it with |
I am facing the same issue. my requests are getting time out randomly. here is the log -: there are 2 errors one is the timeout error, I even increased it using (https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html) -: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): During handling of the above exception, another exception occurred: the code that I used %%time Now, while playing with the --max-running-requests I found out that reducing the default value of this parameter somehow reduced the occurrence of the timeout. The 2nd error I will upload once my execution completes and I am able to again reproduce :). I am in the middle of some tests. |
checkout you vllm version: |
The latest sglang 0.3.0 comes with vLLM 0.6.0 by default and I can confirm that the hanging issue is gone so far, without using
Let me know if you guys have the same problem. |
We used sglang in production and did not meet these problems. A few tips for increasing the stability
|
same problem, on 8xA800, with cuda_graph OFF and custom_all_reduce ON after hanging a long time (maybe 10-20 minutes), NCCL errors occured and showed this log [rank5]:[E925 09:10:50.117391366 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. |
I got the same problem on 8xA800, the error log almost the same |
I have the same issue: the server runs for a while, then stops responding, causing all requests to hang until they eventually time out. Sometimes it happens within 2 minutes, and other times it works for several hours before the issue arises. I know it occurs due to a specific request, but because I receive so many requests, I haven't been able to pinpoint the one causing the problem. Here's what I discovered: When I used the older version of sglang alongside the versions listed below: sglang==0.2.13 everything worked fine, and the issue didn't occur anymore. I experimented with downgrading sglang and found that the issue appears after version 0.2.13, so I'm currently using 0.2.13. Yesterday, I came across something else: I installed a new system but forgot to install the specific version of flashinfer==0.1.5, instead installing the latest version with sglang 0.2.13. The issue resurfaced. So, I found that using sglang==0.2.13 with flashinfer==0.1.6 also causes the same problem. I believe the issue might be related to flashinfer==0.1.6. Update: Yeah, the issue is with flashinfer==0.1.6. I tried sglang==0.2.14 (the latest sglang version that supports flashinfer==0.1.5) with flashinfer==0.1.5, and it worked fine, but with flashinfer==0.1.6, the issue occurs. |
Hi, did you follow my reproduction steps? Can you try follow my reproduction steps and to run stress test for at least 8 hours and see its happend? |
Why did you said the latest sglang shipped with vllm 0.6.0? I find it's still v0.5.5: |
We have also encountered this problem before, and accepting the following PR can solve it |
this worked |
GPU: A100 with nvlink I also encountered this situation. When I used sglang to deploy the qwen2.5-3B lora fine-tuned model and adopted the openai restful api style, after a period of high concurrent requests, the model response would have repetitive problems, and this is the second time I have encountered it. |
Checklist
Describe the bug
sglang run for few hours, it will stop returning valid response, based on the pm2 logs, it does not triggering any error, or message
Expected its always return token output as below
I had to restart every 5-8 hours, when it stop working...
Reproduction
just keep sending request with 1-8 concurrent requests with large input max_new_token around 1000- 7000 for around 7-8 hours or lesser you will see it stop generating token outputs.
above is my running command
Environment
main branch
The text was updated successfully, but these errors were encountered: