[Core][2/N] Helpers for PP #5021

andoorve · 2024-05-24T06:40:57Z

Helpers for #4412

andoorve · 2024-05-26T21:42:04Z

vllm/distributed/utils.py

@@ -134,3 +138,29 @@ def gpu_p2p_access_check(i: int, j: int) -> bool:
        cache = json.load(f)
    _gpu_p2p_access_cache = cache
    return _gpu_p2p_access_cache[f"{i}->{j}"]
+
+
+def get_tensor_model_parallel_src_rank_and_group():


Ideally this function isn't even necessary and we can change the defaults for broadcast_tensor_dict to be TP group broadcast but adding this in to not break the assumption that broadcast_tensor_dict uses World group by default.

We can also return a dict to use as kwargs for the broadcast_tensor_dict if this is less ugly:

broadcast_tensor_dict(**get_tensor_model_parallel_src_rank_and_group())

youkaichao · 2024-06-02T22:54:40Z

Sorry for the late review. I feel the modification is somewhat intrusive. I plan to refactor the distributed related code in a more OOP-style, and every operation only needs to specify a relative rank, following the RFC #3587 .

e.g. the broadcast tensor dict will be get_tp().broadcast_tensor_dict(src=0), and get_tp() will return a tensor parallel coordinator, which knows what does src=0 mean.

andoorve · 2024-06-03T16:41:54Z

What would you recommend in this case @youkaichao? As this is blocking PP, is it possible to have this merged as it does not affect performance or functionality and then have the refactor you suggest above done at a later time?

youkaichao · 2024-06-03T17:18:35Z

I would recommend adding these helpers after the refactor. There are too many "helper functions" in distributed part, which could been organized in a better way. Let's keep code quality before things get out of control.

My ETA is 1~2 weeks, I originally planned to do the refactor, but was interrupted by the nccl stuff which has a higher priority.

andoorve · 2024-06-03T17:27:43Z

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

youkaichao · 2024-06-03T18:01:30Z

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

Yes.

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

andoorve commented May 26, 2024

View reviewed changes

youkaichao self-requested a review May 29, 2024 00:28

Add PP helpers and fix destroy process groups

f3fb8d9

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

andoorve marked this pull request as draft June 4, 2024 22:59

andoorve force-pushed the pp-helpers branch from 9380473 to f3fb8d9 Compare June 4, 2024 22:59

andoorve mentioned this pull request Jun 7, 2024

[Core] Pipeline Parallel Support #4412

Merged

16 tasks

andoorve closed this Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][2/N] Helpers for PP #5021

[Core][2/N] Helpers for PP #5021

andoorve commented May 24, 2024 •

edited

Loading

andoorve May 26, 2024 •

edited

Loading

andoorve May 29, 2024

youkaichao commented Jun 2, 2024

andoorve commented Jun 3, 2024

youkaichao commented Jun 3, 2024

andoorve commented Jun 3, 2024

youkaichao commented Jun 3, 2024

[Core][2/N] Helpers for PP #5021

[Core][2/N] Helpers for PP #5021

Conversation

andoorve commented May 24, 2024 • edited Loading

andoorve May 26, 2024 • edited Loading

Choose a reason for hiding this comment

andoorve May 29, 2024

Choose a reason for hiding this comment

youkaichao commented Jun 2, 2024

andoorve commented Jun 3, 2024

youkaichao commented Jun 3, 2024

andoorve commented Jun 3, 2024

youkaichao commented Jun 3, 2024

andoorve commented May 24, 2024 •

edited

Loading

andoorve May 26, 2024 •

edited

Loading