-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: 使用vllm+ray分布式推理报错 #5779
Comments
I had the same problem.
In addition I had to remove those variables from the environment:
ray==2.24.0 |
@thies1006 did you try https://docs.vllm.ai/en/latest/getting_started/debugging.html , especially the sanity check script? I assume it should catch your problem. I believe it is caused by
|
非常感谢您提供的解决办法,我尝试以后依旧是报这个错,我添加的是这两行代码,不知是否有问题os.environ['GLOO_SOCKET_IFNAME'] = 'eth0' |
非常感谢您提供的建议,我用脚本测试了GPU通信,发生了报错,应该就是这个导致实验无法继续的,不知道您是否有解决方向
报错如下:
|
Hi @JKYtydt
|
@youkaichao 您好,我现在尝试了能找到的解决办法,依旧没能解决这个问题,不知道您这边还有什么其他解决办法,或者需要提供哪些其他信息,帮助解决这个问题,非常感谢 我是在两台windows系统下的电脑进行的,都是Ubuntu系统,两个节点互相能ping通,根据第一次您给我的回复,应该是两个节点的GPU无法通信。不知道这对您是否有所帮助 |
@thies1006 非常感谢您的回复,我还是没能解决这个问题 |
在使用这个脚本测试,发现了另一个问题,如果没有设置--rdzv_backend=c10d这个参数,执行以下命令,报错如下
|
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Your current environment
Python==3.10.14
vllm==0.5.0.post1
ray==2.24.0
Node status
Active:
1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d
1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/52.0 CPU
0.0/2.0 GPU
0B/9.01GiB memory
0B/4.14GiB object_store_memory
Demands:
(no resource demands)
🐛 Describe the bug
在使用 Gloo 进行全网格连接时遇到了问题,没有找到解决办法
脚本如下:
from vllm import LLM
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = llm.generate(prompts)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
报错如下:
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/vllm_test.py", line 13, in
[rank0]: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]: self._run_workers("init_device")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment
[rank0]: init_distributed_environment(parallel_config.world_size, rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment
[rank0]: _WORLD = GroupCoordinator(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init
[rank0]: cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
The text was updated successfully, but these errors were encountered: