Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] sglang run for few hours, it will stop returning valid response #1270

Closed
5 tasks done
liho00 opened this issue Aug 30, 2024 · 20 comments
Closed
5 tasks done

[Bug] sglang run for few hours, it will stop returning valid response #1270

liho00 opened this issue Aug 30, 2024 · 20 comments

Comments

@liho00
Copy link

liho00 commented Aug 30, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

sglang run for few hours, it will stop returning valid response, based on the pm2 logs, it does not triggering any error, or message
image

Expected its always return token output as below
image

I had to restart every 5-8 hours, when it stop working...

Reproduction

just keep sending request with 1-8 concurrent requests with large input max_new_token around 1000- 7000 for around 7-8 hours or lesser you will see it stop generating token outputs.

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 5000 --api-key xxxx --tp 8 --mem-fraction-static 0.8

# tried with 0.7 mem-fraction-static same things happend

above is my running command

Environment

main branch

Name: sglang
Version: 0.2.14.post2
Summary: SGLang is yet another fast serving framework for large language models and vision language models.
Home-page: https://github.com/sgl-project/sglang
Author: 
Author-email: 
License: Apache License

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

Just a bit curious, why is there a vllm in the prefix of each line before the log?

@liho00
Copy link
Author

liho00 commented Aug 30, 2024

Just a bit curious, why is there a vllm in the prefix of each line before the log?

@zhyncs
just ignore that, because i name the script as vllm.sh, i wrote vllm script inside before using sglang

vllm.sh

# vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --api-key xxx --port 5000 --served-model-name Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 8
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct--port 5000 --api-key xxx --tp 8 --mem-fraction-static 0.8 
# --schedule-policy fcfs 
# --schedule-conservativeness 0.3

pm2 start vllm.sh, hence the prefix showing vllm, but its was actually running sglang server

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

Okay, this is an issue worth paying attention to. We place great importance on stability. Can you provide more detailed steps to reproduce the issue? Because we haven't encountered a similar problem online before.

@liho00
Copy link
Author

liho00 commented Aug 31, 2024

[04:40:55 TP0] Decode batch. #running-req: 2, #token: 4078, token usage: 0.00, gen throughput (token/s): 447.31, #queue-req: 0
[04:40:55 TP0] Decode batch. #running-req: 2, #token: 4158, token usage: 0.00, gen throughput (token/s): 405.38, #queue-req: 0
[04:40:55 TP0] Decode batch. #running-req: 2, #token: 4238, token usage: 0.00, gen throughput (token/s): 446.97, #queue-req: 0
[04:40:56 TP0] Decode batch. #running-req: 1, #token: 3902, token usage: 0.00, gen throughput (token/s): 333.97, #queue-req: 0
[04:40:56 TP0] Decode batch. #running-req: 1, #token: 3942, token usage: 0.00, gen throughput (token/s): 238.27, #queue-req: 0
[04:40:56 TP0] Decode batch. #running-req: 1, #token: 3982, token usage: 0.00, gen throughput (token/s): 238.71, #queue-req: 0
[04:40:56 TP0] Decode batch. #running-req: 1, #token: 4022, token usage: 0.00, gen throughput (token/s): 212.82, #queue-req: 0
[04:40:56 TP0] Decode batch. #running-req: 1, #token: 4062, token usage: 0.00, gen throughput (token/s): 238.82, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4102, token usage: 0.00, gen throughput (token/s): 237.51, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4142, token usage: 0.00, gen throughput (token/s): 239.00, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4182, token usage: 0.00, gen throughput (token/s): 195.34, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4222, token usage: 0.00, gen throughput (token/s): 237.64, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4262, token usage: 0.00, gen throughput (token/s): 214.66, #queue-req: 0
[04:40:57 TP0] Decode batch. #running-req: 1, #token: 4302, token usage: 0.00, gen throughput (token/s): 237.25, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4342, token usage: 0.00, gen throughput (token/s): 237.40, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4382, token usage: 0.00, gen throughput (token/s): 238.04, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4422, token usage: 0.00, gen throughput (token/s): 234.95, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4462, token usage: 0.00, gen throughput (token/s): 238.16, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4502, token usage: 0.00, gen throughput (token/s): 238.47, #queue-req: 0
[04:40:58 TP0] Decode batch. #running-req: 1, #token: 4542, token usage: 0.00, gen throughput (token/s): 238.68, #queue-req: 0
[04:40:59 TP0] Decode batch. #running-req: 1, #token: 4582, token usage: 0.00, gen throughput (token/s): 238.36, #queue-req: 0
INFO:     127.0.0.1:46160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[04:40:59 TP0] Prefill batch. #new-seq: 1, #new-token: 99, #cached-token: 29, cache hit rate: 49.75%, #running-req: 1, #queue-req: 0
INFO:     127.0.0.1:46158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[04:40:59 TP0] Prefill batch. #new-seq: 1, #new-token: 88, #cached-token: 29, cache hit rate: 49.75%, #running-req: 2, #queue-req: 0
[04:40:59 TP0] Decode batch. #running-req: 3, #token: 4829, token usage: 0.00, gen throughput (token/s): 228.11, #queue-req: 0
[04:40:59 TP0] Decode batch. #running-req: 3, #token: 4949, token usage: 0.00, gen throughput (token/s): 584.23, #queue-req: 0
[04:40:59 TP0] Decode batch. #running-req: 3, #token: 5069, token usage: 0.00, gen throughput (token/s): 641.59, #queue-req: 0
[04:40:59 TP0] Decode batch. #running-req: 3, #token: 5189, token usage: 0.00, gen throughput (token/s): 587.18, #queue-req: 0
[04:41:00 TP0] Decode batch. #running-req: 3, #token: 5309, token usage: 0.00, gen throughput (token/s): 541.95, #queue-req: 0
INFO:     127.0.0.1:46174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[04:41:00 TP0] Prefill batch. #new-seq: 1, #new-token: 81, #cached-token: 37, cache hit rate: 49.75%, #running-req: 3, #queue-req: 0
INFO:     127.0.0.1:46184 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:46198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40276 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40290 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:39996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40018 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:40032 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:40040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43648 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43656 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:43670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33530 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33578 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:33626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:53204 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:53218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:53226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:53236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:53248 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41356 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:41384 - "POST /v1/chat/completions HTTP/1.1" 200 OK

same issue happend on docker compose

services:
  sglang:
    image: lmsysorg/sglang:latest
    container_name: sglang
    volumes:
      - /home/shadeform:/sglang
    restart: always
    network_mode: host
    # Or you can only publish port 30000
    # ports:
    #   - 30000:30000
    environment:
      HF_TOKEN: <secret>
    entrypoint: python3 -m sglang.launch_server
    command:
      --model-path /sglang/Meta-Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 5000
      --api-key xxx
      --tp 8 
      --schedule-policy fcfs
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:5000/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1', '2', '3', '4', '5', '6', '7']
              capabilities: [gpu]

f0d77e4d26b0   lmsysorg/sglang:latest                "python3 -m sglang.l…"   6 hours ago   Up 6 hours (healthy)                                                                                                                                         sglang

@liho00
Copy link
Author

liho00 commented Sep 1, 2024

import asyncio
import random
from openai import AsyncOpenAI
import json
import time

async def generate_prompts(num_prompts=100):
    subjects = [
        "artificial intelligence", "climate change", "space exploration", "quantum computing",
        "renewable energy", "virtual reality", "biotechnology", "cybersecurity",
        "autonomous vehicles", "blockchain", "3D printing", "robotics", "nanotechnology",
        "gene editing", "Internet of Things", "augmented reality", "machine learning",
        "sustainable agriculture", "smart cities", "digital privacy"
    ]

    prompt_types = [
        "Explain the concept of",
        "Discuss the potential impact of",
        "Compare and contrast two approaches to",
        "Outline the future prospects of",
        "Describe the ethical implications of",
        "Analyze the current state of",
        "Propose a solution using",
        "Evaluate the pros and cons of",
        "Predict how {} will change in the next decade",
        "Discuss the role of {} in solving global challenges",
        "Explain how {} is transforming industry",
        "Describe a day in the life with advanced {}",
        "Outline the key challenges in developing {}",
        "Discuss the intersection of {} and another field",
        "Explain the historical development of"
    ]

    while True:
        subject = random.choice(subjects)
        prompt_type = random.choice(prompt_types)
        prompt = prompt_type.format(subject)
        yield prompt

async def process_prompt(client, prompt, i):
    print(f"\nPrompt {i}: {prompt}\n")

    start_time = time.time()
    response = await client.chat.completions.create(
        model="Meta-Llama-3.1-8B-Instruct",
        stream=False,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ]
    )

    tokens = 0

    print(f"Response {response}:")
    # async for chunk in response:
    #     if chunk.choices[0].delta.content is not None:
    #         print(chunk.choices[0].delta.content, end="", flush=True)
    #         tokens += len(chunk.choices[0].delta.content.split())

    end_time = time.time()
    elapsed_time = end_time - start_time
    tokens_per_second = tokens / elapsed_time if elapsed_time > 0 else 0

    print(f"\n\nTokens per second for prompt {i}: {elapsed_time:.2f}")
    print("-" * 50)

    return tokens_per_second

async def main():
    client = AsyncOpenAI(
        base_url="http://127.0.0.1:5000/v1", 
        api_key="xxx"
    )

    semaphore = asyncio.Semaphore(8)  # Limit to 8 concurrent requests
    prompt_generator = generate_prompts()

    i = 0
    while True:
        i += 1
        prompt = next(prompt_generator)
        
        async with semaphore:
            asyncio.create_task(process_prompt(client, prompt, i))

        await asyncio.sleep(1)  # Adjust the sleep interval as needed

# Run the main loop
asyncio.run(main())

can try run this script for a whole day to reproduce with --tp 8

@m0g1cian
Copy link

m0g1cian commented Sep 3, 2024

Okay, this is an issue worth paying attention to. We place great importance on stability. Can you provide more detailed steps to reproduce the issue? Because we haven't encountered a similar problem online before.

I think this is similar to what we've seen previously related to --disable-custom-all-reduce thing. This is what I've observed so far:

  • SRT randomly hangs when using tensor parallelization (or NCCL is used?). Futhermore, the higher the throughput, the more likely a random TP process would hang during inference.
  • If only one GPU is used per model (TP_SIZE=1, DP_SIZE=X), SRT runs just fine without hanging.
  • It seems to be a A800 issue specifically. I have not seen any similar case when using other NVIDIA GPUs. (@liho00 could you kindly provide what GPU you are using? also @zhyncs you can simulate A800 by using a A100 with NCCL_MAX_NCHANNELS=16 and maybe you can see the same symptom)

Last time I mentioned using NCCL_ALGO=Tree, even though it makes the hanging issue less likely to happen, it cannot 100% prevent this reliability issue unfortunately.

@liho00 Could you try to use --disable-custom-all-reduce to see if it helps the reliability issue? (though the performance degradation is significant). You can also try NCCL_ALGO=Tree separately to see if it helps.

@liho00
Copy link
Author

liho00 commented Sep 4, 2024

Okay, this is an issue worth paying attention to. We place great importance on stability. Can you provide more detailed steps to reproduce the issue? Because we haven't encountered a similar problem online before.

I think this is similar to what we've seen previously related to --disable-custom-all-reduce thing. This is what I've observed so far:

  • SRT randomly hangs when using tensor parallelization (or NCCL is used?). Futhermore, the higher the throughput, the more likely a random TP process would hang during inference.
  • If only one GPU is used per model (TP_SIZE=1, DP_SIZE=X), SRT runs just fine without hanging.
  • It seems to be a A800 issue specifically. I have not seen any similar case when using other NVIDIA GPUs. (@liho00 could you kindly provide what GPU you are using? also @zhyncs you can simulate A800 by using a A100 with NCCL_MAX_NCHANNELS=16 and maybe you can see the same symptom)

Last time I mentioned using NCCL_ALGO=Tree, even though it makes the hanging issue less likely to happen, it cannot 100% prevent this reliability issue unfortunately.

@liho00 Could you try to use --disable-custom-all-reduce to see if it helps the reliability issue? (though the performance degradation is significant). You can also try NCCL_ALGO=Tree separately to see if it helps.

Hi Im using 8xH100, I need the maximum performance, hence i dont think im going to use it with --disable-custom-all-reduce. But thanks!

@tusg
Copy link

tusg commented Sep 9, 2024

I am facing the same issue. my requests are getting time out randomly. here is the log -:
In order to reproduce this issue, I simply took a big prompt ensuring my input length is at least around 3k+ and o/p token length is 768.
The model I used is 16-bit gemma2-2b model loaded on 1 L4 with 21GB used out of 24GB . And you need to run on at least 5k samples to see this issue.

there are 2 errors one is the timeout error, I even increased it using (https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html) -:

Traceback (most recent call last):
File "/home/jupyter/sglang/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 448, in _wait_for_response
raise ValueError(f"Abort request {rid}")
ValueError: Abort request 270be79151ec40f18ea05f901d874961
[16:51:29] Task exception was never retrieved
future: <Task finished name='Task-4638090' coro=<<async_generator_asend without name>()> exception=ValueError('Abort request 270be79151ec40f18ea05f901d874961')>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jupyter/sglang/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 443, in _wait_for_response
await asyncio.wait_for(state.event.wait(), timeout=4)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

the code that I used
`import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

%%time
results=[]
response = client.completions.create(
model="default",
prompt=full_prompt_list[:5000],
temperature=0,
max_tokens=768,
#here I added the timeout parameter and then removed as I did not see any difference.
)
results.append(response.choices[0].text)
print("*********")`

Now, while playing with the --max-running-requests I found out that reducing the default value of this parameter somehow reduced the occurrence of the timeout.
Even though the running-req was around #running-req: 666 , yet I see this issue.

The 2nd error I will upload once my execution completes and I am able to again reproduce :). I am in the middle of some tests.
Please do guide me incase you see I am missing anything.

@jjjjohnson
Copy link
Contributor

jjjjohnson commented Sep 11, 2024

checkout you vllm version:
https://github.com/vllm-project/vllm/pull/6816/files

@m0g1cian
Copy link

checkout you vllm version: https://github.com/vllm-project/vllm/pull/6816/files

The latest sglang 0.3.0 comes with vLLM 0.6.0 by default and I can confirm that the hanging issue is gone so far, without using disable-custom-all-reduce.
However, the reliability issue I am currently having are #1333 and some mysterious NCCL timeout problem

[rank0]:[E911 14:09:50.564227782 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
[rank0]:[E911 14:09:50.564540235 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank2]:[E911 14:09:50.571468924 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
[rank2]:[E911 14:09:50.571732367 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank1]:[E911 14:09:50.640754658 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank1]:[E911 14:09:50.641079638 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank3]:[E911 14:09:50.642179576 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank3]:[E911 14:09:50.642446175 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
<------------OTHER LOGS--------->
[rank0]:[E911 14:27:17.381534232 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E911 14:27:17.381627495 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank2]:[E911 14:27:17.381580340 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 2] First PG on this rank that detected no heartbeat of its watchdog.
[rank1]:[E911 14:27:17.381580311 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 1] First PG on this rank that detected no heartbeat of its watchdog.
[rank1]:[E911 14:27:17.381649923 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank2]:[E911 14:27:17.381653573 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 2] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E911 14:27:17.381702036 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 3] First PG on this rank that detected no heartbeat of its watchdog.
[rank3]:[E911 14:27:17.381767851 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1

Let me know if you guys have the same problem.

@merrymercy
Copy link
Contributor

We used sglang in production and did not meet these problems. A few tips for increasing the stability

  1. Try to disable custom all reduce by --disable-custom-all-reduce. For @liho00, this custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot. You can try it on your workloads.
  2. Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.
  3. Try other NCCL configs as suggested by @m0g1cian
  4. Try to match cuda driver versions.

@learninmou
Copy link
Contributor

learninmou commented Sep 25, 2024

same problem, on 8xA800, with cuda_graph OFF and custom_all_reduce ON
sglang version : 0.3.0

after hanging a long time (maybe 10-20 minutes), NCCL errors occured and showed this log

[rank5]:[E925 09:10:50.117391366 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
260
[rank5]:[E925 09:10:50.117657959 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 28387, last enqueued NCCL work: 28387, last completed NCCL work: 28386.
261
[rank5]:[E925 09:10:50.117668312 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 5] Timeout at NCCL work: 28387, last enqueued NCCL work: 28387, last completed NCCL work: 28386.
262
[rank5]:[E925 09:10:50.117672633 ProcessGroupNCCL.cpp:621] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
263
[rank5]:[E925 09:10:50.117676943 ProcessGroupNCCL.cpp:627] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
264
[rank5]:[E925 09:10:50.119128421 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
265
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
266
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
267
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe45bd848d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
268
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe45bd8b313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
269
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe45bd8d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
270
frame #4: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
271
frame #5: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
272
frame #6: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
273
terminate called after throwing an instance of 'c10::DistBackendError'
274
what(): [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
275
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
276
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
277
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe45bd848d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
278
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe45bd8b313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
279
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe45bd8d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
280
frame #4: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
281
frame #5: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
282
frame #6: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
283
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
284
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
285
frame #1: + 0xe5aa84 (0x7fe45ba16a84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
286
frame #2: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
287
frame #3: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
288
frame #4: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

@zeng-zc
Copy link
Contributor

zeng-zc commented Sep 25, 2024

same problem, on 8xA800, with cuda_graph OFF and custom_all_reduce ON sglang version : 0.3.0

after hanging a long time (maybe 10-20 minutes), NCCL errors occured and showed this log

[rank5]:[E925 09:10:50.117391366 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. 260 [rank5]:[E925 09:10:50.117657959 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 28387, last enqueued NCCL work: 28387, last completed NCCL work: 28386. 261 [rank5]:[E925 09:10:50.117668312 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 5] Timeout at NCCL work: 28387, last enqueued NCCL work: 28387, last completed NCCL work: 28386. 262 [rank5]:[E925 09:10:50.117672633 ProcessGroupNCCL.cpp:621] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 263 [rank5]:[E925 09:10:50.117676943 ProcessGroupNCCL.cpp:627] [Rank 5] To avoid data inconsistency, we are taking the entire process down. 264 [rank5]:[E925 09:10:50.119128421 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. 265 Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): 266 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 267 frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe45bd848d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 268 frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe45bd8b313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 269 frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe45bd8d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 270 frame #4: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 271 frame #5: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 272 frame #6: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6) 273 terminate called after throwing an instance of 'c10::DistBackendError' 274 what(): [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28387, OpType=_ALLGATHER_BASE, NumelIn=25088, NumelOut=200704, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. 275 Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): 276 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 277 frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe45bd848d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 278 frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe45bd8b313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 279 frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe45bd8d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 280 frame #4: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 281 frame #5: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 282 frame #6: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6) 283 Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): 284 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe4a9b77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 285 frame #1: + 0xe5aa84 (0x7fe45ba16a84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 286 frame #2: + 0xdc253 (0x7fe4b8674253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 287 frame #3: + 0x94ac3 (0x7fe4cca30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 288 frame #4: + 0x126850 (0x7fe4ccac2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

I got the same problem on 8xA800, the error log almost the same

@PawanOsman
Copy link

PawanOsman commented Sep 25, 2024

I have the same issue: the server runs for a while, then stops responding, causing all requests to hang until they eventually time out. Sometimes it happens within 2 minutes, and other times it works for several hours before the issue arises. I know it occurs due to a specific request, but because I receive so many requests, I haven't been able to pinpoint the one causing the problem.

Here's what I discovered:

When I used the older version of sglang alongside the versions listed below:

sglang==0.2.13
flashinfer==0.1.5+cu121torch2.4
vllm==0.5.4
vllm-flash-attn==2.6.1

everything worked fine, and the issue didn't occur anymore.

I experimented with downgrading sglang and found that the issue appears after version 0.2.13, so I'm currently using 0.2.13.

Yesterday, I came across something else: I installed a new system but forgot to install the specific version of flashinfer==0.1.5, instead installing the latest version with sglang 0.2.13. The issue resurfaced. So, I found that using sglang==0.2.13 with flashinfer==0.1.6 also causes the same problem.

I believe the issue might be related to flashinfer==0.1.6.

Update: Yeah, the issue is with flashinfer==0.1.6. I tried sglang==0.2.14 (the latest sglang version that supports flashinfer==0.1.5) with flashinfer==0.1.5, and it worked fine, but with flashinfer==0.1.6, the issue occurs.

@liho00
Copy link
Author

liho00 commented Sep 26, 2024

We used sglang in production and did not meet these problems. A few tips for increasing the stability

  1. Try to disable custom all reduce by --disable-custom-all-reduce. For @liho00, this custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot. You can try it on your workloads.
  2. Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.
  3. Try other NCCL configs as suggested by @m0g1cian
  4. Try to match cuda driver versions.

Hi, did you follow my reproduction steps?

Can you try follow my reproduction steps and to run stress test for at least 8 hours and see its happend?
I guess if you run normally will be fine...

@zeng-zc
Copy link
Contributor

zeng-zc commented Sep 29, 2024

checkout you vllm version: https://github.com/vllm-project/vllm/pull/6816/files

The latest sglang 0.3.0 comes with vLLM 0.6.0 by default and I can confirm that the hanging issue is gone so far, without using disable-custom-all-reduce. However, the reliability issue I am currently having are #1333 and some mysterious NCCL timeout problem

[rank0]:[E911 14:09:50.564227782 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
[rank0]:[E911 14:09:50.564540235 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank2]:[E911 14:09:50.571468924 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
[rank2]:[E911 14:09:50.571732367 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank1]:[E911 14:09:50.640754658 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank1]:[E911 14:09:50.641079638 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
[rank3]:[E911 14:09:50.642179576 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559608, OpType=_ALLGATHER_BASE, NumelIn=32064, NumelOut=128256, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank3]:[E911 14:09:50.642446175 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 559608, last enqueued NCCL work: 559608, last completed NCCL work: 559607.
<------------OTHER LOGS--------->
[rank0]:[E911 14:27:17.381534232 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E911 14:27:17.381627495 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank2]:[E911 14:27:17.381580340 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 2] First PG on this rank that detected no heartbeat of its watchdog.
[rank1]:[E911 14:27:17.381580311 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 1] First PG on this rank that detected no heartbeat of its watchdog.
[rank1]:[E911 14:27:17.381649923 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank2]:[E911 14:27:17.381653573 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 2] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E911 14:27:17.381702036 ProcessGroupNCCL.cpp:1375] [PG 3 Rank 3] First PG on this rank that detected no heartbeat of its watchdog.
[rank3]:[E911 14:27:17.381767851 ProcessGroupNCCL.cpp:1413] [PG 3 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1

Let me know if you guys have the same problem.

Why did you said the latest sglang shipped with vllm 0.6.0? I find it's still v0.5.5:
https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml#L26

@coolhok
Copy link
Contributor

coolhok commented Oct 10, 2024

We used sglang in production and did not meet these problems. A few tips for increasing the stability

  1. Try to disable custom all reduce by --disable-custom-all-reduce. For @liho00, this custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot. You can try it on your workloads.
  2. Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.
  3. Try other NCCL configs as suggested by @m0g1cian
  4. Try to match cuda driver versions.

We have also encountered this problem before, and accepting the following PR can solve it
vllm-project/vllm#5866
@zhyncs

@linxkon
Copy link

linxkon commented Oct 21, 2024

We used sglang in production and did not meet these problems. A few tips for increasing the stability

  1. Try to disable custom all reduce by --disable-custom-all-reduce. For @liho00, this custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot. You can try it on your workloads.
  2. Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.
  3. Try other NCCL configs as suggested by @m0g1cian
  4. Try to match cuda driver versions.

this worked --disable-custom-all-reduce

@merrymercy
Copy link
Contributor

Thanks for reporting this. It has been fixed by #1779, #1782, and #1789

Try 0.3.4.post2

@Alwin4Zhang
Copy link

GPU: A100 with nvlink
sglang version:0.3.5
vllm version: 0.6.3.post1
pytorch 2.4.0
flashinfer 0.1.6+cu121torch2.4

image
image

I also encountered this situation. When I used sglang to deploy the qwen2.5-3B lora fine-tuned model and adopted the openai restful api style, after a period of high concurrent requests, the model response would have repetitive problems, and this is the second time I have encountered it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests