Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dist.all_gather related collective ops #7860

Merged
merged 8 commits into from
Aug 27, 2024
Merged

Support dist.all_gather related collective ops #7860

merged 8 commits into from
Aug 27, 2024

Conversation

zpcore
Copy link
Collaborator

@zpcore zpcore commented Aug 15, 2024

Add dynamo/nondynamo support for torch.distributed.all_reduce and torch.distributed.all_gather_into_tensor.

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

Issue: dist.all_reduct doesn't work with dynamo openxla backend at this time.

@zpcore
Copy link
Collaborator Author

zpcore commented Aug 16, 2024

Hi @JackCaoG , I commented out assert met.metric_data("ExecuteTime")[0] == 1 in the test_traceable_collectives.py since the value changed to 3 for all_gather ops due to upstream changes. Shall we enable the test first so we can run the test?

@zpcore zpcore added the tpuci label Aug 16, 2024
@zpcore zpcore marked this pull request as ready for review August 16, 2024 20:02
@zpcore zpcore requested a review from lsy323 August 16, 2024 20:02
@zpcore zpcore changed the title prototype of all_gather related distributed ops Support dist.all_gather related distributed ops Aug 16, 2024
@zpcore zpcore changed the title Support dist.all_gather related distributed ops Support dist.all_gather related collective ops Aug 16, 2024
@zpcore
Copy link
Collaborator Author

zpcore commented Aug 16, 2024

I have no idea how this works:

xm.all_reduce(reduce_type, tensors, groups=self._mesh, pin_layout=False)
return _ret_work(tensors)

xm.all_reduce should return a tensor instead of modifying inputs argument.

I will probably give up supporting dist.all_reduce in this PR.

Update: turns out for output_tensor=dist.all_reduce(intput_tensor...)both intput_tensor and output_tensor got updated. We have to use the input_tensor as the final results for the nondynamo path.

@lsy323 lsy323 removed their request for review August 20, 2024 16:26
@zpcore zpcore requested a review from will-cromar August 20, 2024 22:51
@zpcore zpcore requested a review from will-cromar August 21, 2024 23:12
@zpcore zpcore merged commit f9a706e into master Aug 27, 2024
23 checks passed
@zpcore zpcore deleted the piz/cop branch August 27, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants