[Core][Distributed] refactor pynccl to hold multiple communicators #4591

youkaichao · 2024-05-03T20:28:11Z

Currently pynccl is bound to the module level instance. And we are using it for just tensor parallel group.

After this refactor, we can create as many pynccl communicator instances as we want, e.g. a new pynccl communicator for pipeline parallel group.

This is an ongoing effort to support pipeline parallel #4412 .

WoosukKwon

@youkaichao Thanks a lot for the PR! The refactoring makes sense.

Left some comments on code style and possible errors in the PR. Please check my review.

tests/distributed/test_pynccl.py

vllm/distributed/communication_op.py

vllm/distributed/device_communicators/pynccl.py

vllm/distributed/parallel_state.py

WoosukKwon · 2024-05-09T06:02:02Z

vllm/distributed/parallel_state.py

+        # A small all_reduce for warmup.
+        data = torch.zeros(1)
+        if torch.cuda.is_available():
+            data = data.to(device=f"cuda:{local_rank}")
+        torch.distributed.all_reduce(data)


I feel warmup should not be a part of this method?

which method do you think is better?

WoosukKwon · 2024-05-09T06:02:29Z

vllm/distributed/parallel_state.py

        group = torch.distributed.new_group(ranks, backend=backend)
        cpu_group = torch.distributed.new_group(ranks, backend="gloo")
        if rank in ranks:
            _TP_DEVICE_GROUP = group
            _TP_CPU_GROUP = cpu_group

+    from vllm.distributed.device_communicators.pynccl import NCCLCommunicator


Again, why do we need this lazy import?

lazy import is required here to avoid circular import. vllm.distributed.device_communicators.pynccl will try to import vllm/distributed/parallel_state.py.

If that's the case, I think it means it's a bad design tbh. We should use lazy import only to avoid the unnecessary imports, but not to avoid circular imports. Otherwise, the code will be too complicated.

Yeah, we can have a better design. The reason why we have this circular import, is because we tried very hard to figure out the default argument for the group (which requires the import from vllm/distributed/parallel_state.py). We can remove this, but it might break some old code. ( I can do it if you think it is good).

youkaichao · 2024-05-09T07:15:34Z

@WoosukKwon one change after our discussion:

c1b1cdb change with pynccl_comm.enable() to with pynccl_comm.change_state(enable=True) . I think this makes more sense.

WoosukKwon

LGTM. Thanks for addressing my review! Looking forward to the planned refactoring!

[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)

youkaichao added 30 commits May 3, 2024 10:57

add cache for loading the same library multiple times

9c6130a

refactor code

1493243

fix import

cadcd02

remove pynccl_utils.init_process_group

7918798

remove pynccl_utils.is_initialized

5924038

remove pynccl_utils.destroy_process_group

813b047

remove pynccl_utils.get_world_size

b244e6c

remove pynccl_utils.get_nccl_backend

7e15c98

remove is_pynccl_enabled_for_all_reduce

e610f64

remove _ENABLE_PYNCCL_FOR_ALL_REDUCE

8480995

remove set_pynccl_stream

5ed6f07

remove pynccl utils

8134287

fix state

e65e9ef

fix test

c8b6fc0

fix import

c7a2f0c

move warmup into pynccl

75a8d11

add device

59c064e

fix device for allreduce warmup

16aeef1

improve ways of discovering default local rank

4710fc3

make sure warmup happens in stream

c8542ec

add disable

b2d2661

do not init when world size is 1

67d1d9a

fix initial state of pynccl allreduce

c86199c

add comments

0030a31

add context manager

49f6d91

refactor logic of available

38b148b

non-intrusive code

d241480

clean up pynccl enable or disable

d7209f1

fix isort

7b55026

fix stream attribute

ee734b1

WoosukKwon requested changes May 9, 2024

View reviewed changes

WoosukKwon removed their assignment May 9, 2024

WoosukKwon added the action-required label May 9, 2024

youkaichao added 11 commits May 8, 2024 23:23

fix import

0516956

rename to PyNcclCommunicator and pynccl_comm

9f63bf8

rename use_pynccl_allreduce

e9aa766

fix lint

0f64301

fix lint

a64962e

fix lint

d2f83ba

fix dependency on custom_all_reduce

12f309b

fix lint

68e448c

use _PP_DEVICE_GROUP

ad6f840

use _PP_GLOBAL_RANKS

e2153b2

fix lint

80aca94

youkaichao removed the action-required label May 9, 2024

youkaichao requested a review from WoosukKwon May 9, 2024 07:04

use change_state rather than enable

c1b1cdb

youkaichao removed the request for review from zhuohan123 May 9, 2024 08:04

youkaichao and others added 2 commits May 9, 2024 09:18

Merge branch 'main' into bind_pynccl_to_group

c4e3b0f

add get_tp_pynccl_communicator

70a7e26

WoosukKwon approved these changes May 9, 2024

View reviewed changes

youkaichao merged commit 208b71b into vllm-project:main May 10, 2024
55 checks passed

youkaichao deleted the bind_pynccl_to_group branch May 10, 2024 02:48

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Core][Distributed] refactor pynccl (vllm-project#4591)

ca3311a

[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Core][Distributed] refactor pynccl (vllm-project#4591)

ce0f149

[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)

kerthcet mentioned this pull request Jun 7, 2024

[Bug]: vllm 0.4.1 crashing after checking P2P status on single GPU #4587

Open

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][Distributed] refactor pynccl (vllm-project#4591)

679a191

[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] refactor pynccl to hold multiple communicators #4591

[Core][Distributed] refactor pynccl to hold multiple communicators #4591

youkaichao commented May 3, 2024 •

edited

Loading

WoosukKwon left a comment

WoosukKwon May 9, 2024

youkaichao May 9, 2024

WoosukKwon May 9, 2024

youkaichao May 9, 2024

WoosukKwon May 9, 2024 •

edited

Loading

youkaichao May 9, 2024

youkaichao commented May 9, 2024

WoosukKwon left a comment

[Core][Distributed] refactor pynccl to hold multiple communicators #4591

[Core][Distributed] refactor pynccl to hold multiple communicators #4591

Conversation

youkaichao commented May 3, 2024 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon May 9, 2024

Choose a reason for hiding this comment

youkaichao May 9, 2024

Choose a reason for hiding this comment

WoosukKwon May 9, 2024

Choose a reason for hiding this comment

youkaichao May 9, 2024

Choose a reason for hiding this comment

WoosukKwon May 9, 2024 • edited Loading

Choose a reason for hiding this comment

youkaichao May 9, 2024

Choose a reason for hiding this comment

youkaichao commented May 9, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

youkaichao commented May 3, 2024 •

edited

Loading

WoosukKwon May 9, 2024 •

edited

Loading