Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][2/N] Helpers for PP #5021

Closed
wants to merge 1 commit into from
Closed

Conversation

andoorve
Copy link
Collaborator

@andoorve andoorve commented May 24, 2024

Helpers for #4412

cc: @youkaichao @simon-mo

@@ -134,3 +138,29 @@ def gpu_p2p_access_check(i: int, j: int) -> bool:
cache = json.load(f)
_gpu_p2p_access_cache = cache
return _gpu_p2p_access_cache[f"{i}->{j}"]


def get_tensor_model_parallel_src_rank_and_group():
Copy link
Collaborator Author

@andoorve andoorve May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this function isn't even necessary and we can change the defaults for broadcast_tensor_dict to be TP group broadcast but adding this in to not break the assumption that broadcast_tensor_dict uses World group by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also return a dict to use as kwargs for the broadcast_tensor_dict if this is less ugly:

broadcast_tensor_dict(**get_tensor_model_parallel_src_rank_and_group())

@youkaichao youkaichao self-requested a review May 29, 2024 00:28
@youkaichao
Copy link
Member

Sorry for the late review. I feel the modification is somewhat intrusive. I plan to refactor the distributed related code in a more OOP-style, and every operation only needs to specify a relative rank, following the RFC #3587 .

e.g. the broadcast tensor dict will be get_tp().broadcast_tensor_dict(src=0), and get_tp() will return a tensor parallel coordinator, which knows what does src=0 mean.

@andoorve
Copy link
Collaborator Author

andoorve commented Jun 3, 2024

What would you recommend in this case @youkaichao? As this is blocking PP, is it possible to have this merged as it does not affect performance or functionality and then have the refactor you suggest above done at a later time?

@youkaichao
Copy link
Member

I would recommend adding these helpers after the refactor. There are too many "helper functions" in distributed part, which could been organized in a better way. Let's keep code quality before things get out of control.

My ETA is 1~2 weeks, I originally planned to do the refactor, but was interrupted by the nccl stuff which has a higher priority.

@andoorve
Copy link
Collaborator Author

andoorve commented Jun 3, 2024

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

@youkaichao
Copy link
Member

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

Yes.

    Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
@andoorve andoorve marked this pull request as draft June 4, 2024 22:59
@andoorve andoorve mentioned this pull request Jun 7, 2024
16 tasks
@andoorve andoorve closed this Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants