-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][2/N] Helpers for PP #5021
Conversation
@@ -134,3 +138,29 @@ def gpu_p2p_access_check(i: int, j: int) -> bool: | |||
cache = json.load(f) | |||
_gpu_p2p_access_cache = cache | |||
return _gpu_p2p_access_cache[f"{i}->{j}"] | |||
|
|||
|
|||
def get_tensor_model_parallel_src_rank_and_group(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this function isn't even necessary and we can change the defaults for broadcast_tensor_dict
to be TP group broadcast but adding this in to not break the assumption that broadcast_tensor_dict
uses World group by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also return a dict to use as kwargs for the broadcast_tensor_dict
if this is less ugly:
broadcast_tensor_dict(**get_tensor_model_parallel_src_rank_and_group())
Sorry for the late review. I feel the modification is somewhat intrusive. I plan to refactor the distributed related code in a more OOP-style, and every operation only needs to specify a relative rank, following the RFC #3587 . e.g. the broadcast tensor dict will be |
What would you recommend in this case @youkaichao? As this is blocking PP, is it possible to have this merged as it does not affect performance or functionality and then have the refactor you suggest above done at a later time? |
I would recommend adding these helpers after the refactor. There are too many "helper functions" in distributed part, which could been organized in a better way. Let's keep code quality before things get out of control. My ETA is 1~2 weeks, I originally planned to do the refactor, but was interrupted by the nccl stuff which has a higher priority. |
Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done? |
Yes. |
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Helpers for #4412
cc: @youkaichao @simon-mo