Skip to content

Commit 85466db

Browse files
youkaichaoprashantgupta24
authored andcommitted
[doc][distributed] add both gloo and nccl tests (vllm-project#5834)
1 parent 85ca6e3 commit 85466db

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

docs/source/getting_started/debugging.rst

+10-3
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ If it crashes, and the error trace shows somewhere around ``self.graph.replay()`
2828

2929
Here are some common issues that can cause hangs:
3030

31-
- **Incorrect network setup**: The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``.
32-
- **Incorrect hardware/driver**: GPU communication cannot be established. You can run the following sanity check script to see if the GPU communication is working correctly.
31+
- **Incorrect network setup**: The vLLM instance cannot get the correct IP address if you have complicated network config. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``. You might also need to set ``export NCCL_SOCKET_IFNAME=your_network_interface`` and ``export GLOO_SOCKET_IFNAME=your_network_interface`` to specify the network interface for the IP address.
32+
- **Incorrect hardware/driver**: GPU/CPU communication cannot be established. You can run the following sanity check script to see if the GPU/CPU communication is working correctly.
3333

3434
.. code-block:: python
3535
@@ -41,7 +41,14 @@ Here are some common issues that can cause hangs:
4141
dist.all_reduce(data, op=dist.ReduceOp.SUM)
4242
torch.cuda.synchronize()
4343
value = data.mean().item()
44-
assert value == dist.get_world_size()
44+
world_size = dist.get_world_size()
45+
assert value == world_size, f"Expected {world_size}, got {value}"
46+
47+
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
48+
cpu_data = torch.FloatTensor([1,] * 128)
49+
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
50+
value = cpu_data.mean().item()
51+
assert value == world_size, f"Expected {world_size}, got {value}"
4552
4653
.. tip::
4754

0 commit comments

Comments
 (0)