[Bug]: 使用vllm+ray分布式推理报错 #5779

JKYtydt · 2024-06-24T06:53:44Z

Your current environment

Python==3.10.14
vllm==0.5.0.post1
ray==2.24.0

Node status

Active:
1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d
1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/52.0 CPU
0.0/2.0 GPU
0B/9.01GiB memory
0B/4.14GiB object_store_memory

Demands:
(no resource demands)

🐛 Describe the bug

在使用 Gloo 进行全网格连接时遇到了问题，没有找到解决办法
脚本如下：
from vllm import LLM
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]

llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)

outputs = llm.generate(prompts)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

报错如下：
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/vllm_test.py", line 13, in
[rank0]: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]: self._run_workers("init_device")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment
[rank0]: init_distributed_environment(parallel_config.world_size, rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment
[rank0]: _WORLD = GroupCoordinator(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init
[rank0]: cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

The text was updated successfully, but these errors were encountered:

thies1006 · 2024-06-24T13:05:33Z

I had the same problem.
I solved it by setting the variables:

os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'

In addition I had to remove those variables from the environment:

http_proxy
https_proxy
ftp_proxy

ray==2.24.0
vllm==0.4.3

youkaichao · 2024-06-24T16:29:25Z

@thies1006 did you try https://docs.vllm.ai/en/latest/getting_started/debugging.html , especially the sanity check script? I assume it should catch your problem.

I believe it is caused by GLOO_SOCKET_IFNAME .

TP_SOCKET_IFNAME is for https://github.com/pytorch/tensorpipe , and gloo should not use http/https.

JKYtydt · 2024-06-25T07:24:10Z

我有同样的问题。我通过设置变量解决了它：
os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'
此外，我还必须从环境中删除这些变量：
http_proxy
https_proxy
ftp_proxy
ray==2.24.0 vllm==0.4.3

非常感谢您提供的解决办法，我尝试以后依旧是报这个错，我添加的是这两行代码，不知是否有问题os.environ['GLOO_SOCKET_IFNAME'] = 'eth0'
os.environ['TP_SOCKET_IFNAME'] = 'eth0'

JKYtydt · 2024-06-25T08:30:58Z

@thies1006您是否尝试过https://docs.vllm.ai/en/latest/getting_started/debugging.html，尤其是健全性检查脚本？我认为它应该可以解决您的问题。

我相信这是由造成的GLOO_SOCKET_IFNAME。

TP_SOCKET_IFNAME适用于https://github.com/pytorch/tensorpipe，并且 gloo 不应该使用 http/https。

非常感谢您提供的建议，我用脚本测试了GPU通信，发生了报错，应该就是这个导致实验无法继续的，不知道您是否有解决方向

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()

报错如下：

[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).
Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 157, in _create_tcp_store
    store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 235, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263, in create_handler
    handler = creator(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 255, in create_backend
    store = _create_tcp_store(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 181, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

thies1006 · 2024-06-25T08:37:17Z

Hi @JKYtydt
Here are my steps to run on two nodes.

first make sure ray is not running (run ray stop on all nodes)
export GLOO_SOCKET_IFNAME='<device_name>' (not sure but I think it is enough to set this on the head node only)
start ray on all nodes, e.g. ray start --head --node-ip-address <ip_address head> (for head node) and ray start --address='<ip_address head>' (for other nodes)
start vllm: python -m vllm.entrypoints.openai.api_server --model <model_name> --tensor-parallel-size <world_size>

JKYtydt · 2024-06-26T10:45:57Z

@youkaichao 您好，我现在尝试了能找到的解决办法，依旧没能解决这个问题，不知道您这边还有什么其他解决办法，或者需要提供哪些其他信息，帮助解决这个问题，非常感谢

我是在两台windows系统下的电脑进行的，都是Ubuntu系统，两个节点互相能ping通，根据第一次您给我的回复，应该是两个节点的GPU无法通信。不知道这对您是否有所帮助

JKYtydt · 2024-06-26T10:46:29Z

@thies1006 非常感谢您的回复，我还是没能解决这个问题

JKYtydt · 2024-06-26T10:51:54Z

在使用这个脚本测试，发现了另一个问题，如果没有设置--rdzv_backend=c10d这个参数，执行以下命令，报错如下

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()

 torchrun --nnodes 2 --nproc-per-node 1 --rdzv_endpoint=192.168.41.79:29502 test.py

Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
    result = agent.run()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
    result = self._invoke_run(role)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistStoreError: Timed out after 901 seconds waiting for clients. 1/2 clients joined.

github-actions · 2024-10-25T02:04:21Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-24T02:09:35Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

JKYtydt added the bug Something isn't working label Jun 24, 2024

youkaichao mentioned this issue Jun 25, 2024

[doc][distributed] add both gloo and nccl tests #5834

Merged

github-actions bot added the stale Over 90 days of inactivity label Oct 25, 2024

guoyuhong mentioned this issue Nov 20, 2024

[Bugfix] Fix some cases of The client socket has timed out after 600s while trying to connect to #10492

Closed

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 使用vllm+ray分布式推理报错 #5779

[Bug]: 使用vllm+ray分布式推理报错 #5779

JKYtydt commented Jun 24, 2024

thies1006 commented Jun 24, 2024 •

edited

Loading

youkaichao commented Jun 24, 2024 •

edited

Loading

JKYtydt commented Jun 25, 2024

JKYtydt commented Jun 25, 2024

thies1006 commented Jun 25, 2024

JKYtydt commented Jun 26, 2024

JKYtydt commented Jun 26, 2024

JKYtydt commented Jun 26, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

[Bug]: 使用vllm+ray分布式推理报错 #5779

[Bug]: 使用vllm+ray分布式推理报错 #5779

Comments

JKYtydt commented Jun 24, 2024

Your current environment

Node status

Resources

🐛 Describe the bug

thies1006 commented Jun 24, 2024 • edited Loading

youkaichao commented Jun 24, 2024 • edited Loading

JKYtydt commented Jun 25, 2024

JKYtydt commented Jun 25, 2024

thies1006 commented Jun 25, 2024

JKYtydt commented Jun 26, 2024

JKYtydt commented Jun 26, 2024

JKYtydt commented Jun 26, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

thies1006 commented Jun 24, 2024 •

edited

Loading

youkaichao commented Jun 24, 2024 •

edited

Loading