Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Distributed] refactor pynccl to support multilpe TP groups #4460

Closed
wants to merge 23 commits into from

Conversation

youkaichao
Copy link
Member

@youkaichao youkaichao commented Apr 29, 2024

TL;DR:

  1. The line of code change is large, because I moved code from pynccl.py to pynccl_wrapper.py
  2. Before this PR, vLLM only supports one nccl communicator per process, and only support one tensor parallel group. After this PR, we can have multiple nccl communicators per process, e.g. one for tp and one for pp. Multiple tensor parallel groups are also supported.
  3. Before this PR, whether we use pynccl for allreduce is changed back and forth, but actually just depends on whether we use cudagraph. So I changed it to be just one function change_model_parallel_pynccl_allreduce .

There are three groups in vLLM:

  1. world group, created by init_distributed_environment
  2. tp group, created by initialize_model_parallel
  3. pp group, created by initialize_model_parallel (should refactor in the future, at least the name should be init_tp_and_pp stuff)

device communicator's life cycle should align with the corresponding group. However, currently our device communicator is actually bound to the world group, causing many troubles like init and destroy. It kind of work because we only have one tp group, but will bring trouble in future pp.

This pr fixed the problem, so that device communicators are aligned with the corresponding group.

@youkaichao youkaichao marked this pull request as draft April 29, 2024 21:17
@youkaichao youkaichao changed the title [WIP][Core][Distributed] refactor pynccl communicator [Distributed] refactor pynccl to support multilpe TP groups Apr 30, 2024
@youkaichao youkaichao marked this pull request as ready for review April 30, 2024 03:03
@youkaichao
Copy link
Member Author

I feel this PR is quite large. Will break it into small pieces.

@youkaichao youkaichao mentioned this pull request May 1, 2024
16 tasks
@youkaichao
Copy link
Member Author

close as this is split into #4512 #4566 and #4591 .

@youkaichao youkaichao closed this May 4, 2024
@youkaichao youkaichao deleted the pynccl_refactor branch May 4, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant