Support `dist.all_gather` related collective ops #7860

zpcore · 2024-08-15T17:15:42Z

Add dynamo/nondynamo support for torch.distributed.all_reduce and torch.distributed.all_gather_into_tensor.

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

Issue: dist.all_reduct doesn't work with dynamo openxla backend at this time.

zpcore · 2024-08-16T19:14:33Z

Hi @JackCaoG , I commented out assert met.metric_data("ExecuteTime")[0] == 1 in the test_traceable_collectives.py since the value changed to 3 for all_gather ops due to upstream changes. Shall we enable the test first so we can run the test?

zpcore · 2024-08-16T22:57:05Z

I have no idea how this works:

xla/torch_xla/distributed/xla_backend.py

Lines 71 to 72 in 37312c1

    
           xm.all_reduce(reduce_type, tensors, groups=self._mesh, pin_layout=False) 
        
           return _ret_work(tensors)

xm.all_reduce should return a tensor instead of modifying inputs argument.

I will probably give up supporting dist.all_reduce in this PR.

Update: turns out for output_tensor=dist.all_reduce(intput_tensor...)both intput_tensor and output_tensor got updated. We have to use the input_tensor as the final results for the nondynamo path.

test/dynamo/test_traceable_collectives.py

torch_xla/distributed/xla_backend.py

torch_xla/csrc/tensor_methods.cpp

torch_xla/distributed/xla_backend.py

test/pjrt/test_collective_ops_tpu.py

zpcore force-pushed the piz/cop branch from 3529d20 to ca0fd5d Compare August 16, 2024 19:10

zpcore requested review from will-cromar and JackCaoG August 16, 2024 19:11

zpcore force-pushed the piz/cop branch from ca0fd5d to 6e23e8a Compare August 16, 2024 19:55

zpcore added the tpuci label Aug 16, 2024

zpcore marked this pull request as ready for review August 16, 2024 20:02

zpcore requested a review from lsy323 August 16, 2024 20:02

zpcore changed the title ~~prototype of all_gather related distributed ops~~ Support dist.all_gather related distributed ops Aug 16, 2024

zpcore changed the title ~~Support dist.all_gather related distributed ops~~ Support dist.all_gather related collective ops Aug 16, 2024

will-cromar reviewed Aug 19, 2024

View reviewed changes

test/dynamo/test_traceable_collectives.py Outdated Show resolved Hide resolved

torch_xla/distributed/xla_backend.py Show resolved Hide resolved

torch_xla/csrc/tensor_methods.cpp Outdated Show resolved Hide resolved

lsy323 removed their request for review August 20, 2024 16:26

zpcore force-pushed the piz/cop branch from 7ec8f12 to b37add2 Compare August 20, 2024 22:50

zpcore requested a review from will-cromar August 20, 2024 22:51

will-cromar reviewed Aug 21, 2024

View reviewed changes

zpcore requested a review from will-cromar August 21, 2024 23:12

will-cromar reviewed Aug 22, 2024

View reviewed changes

test/pjrt/test_collective_ops_tpu.py Outdated Show resolved Hide resolved

test/pjrt/test_collective_ops_tpu.py Outdated Show resolved Hide resolved

zpcore requested a review from will-cromar August 22, 2024 21:46

zpcore force-pushed the piz/cop branch from 09bba56 to 0dd525c Compare August 23, 2024 18:12

will-cromar approved these changes Aug 26, 2024

View reviewed changes

zpcore added 8 commits August 27, 2024 15:47

add test case

4b988c3

nit

ca757f9

nit

15b8059

clean up test and add dist.all_gather

6e97295

remove mock object

65989e3

nit

583d984

update to openxla backend

ecac71b

use toy backend

dfe8a54

zpcore force-pushed the piz/cop branch from 0dd525c to dfe8a54 Compare August 27, 2024 15:48

zpcore merged commit f9a706e into master Aug 27, 2024
23 checks passed

zpcore deleted the piz/cop branch August 27, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `dist.all_gather` related collective ops #7860

Support `dist.all_gather` related collective ops #7860

zpcore commented Aug 15, 2024 •

edited

Loading

zpcore commented Aug 16, 2024

zpcore commented Aug 16, 2024 •

edited

Loading

Support dist.all_gather related collective ops #7860

Support dist.all_gather related collective ops #7860

Conversation

zpcore commented Aug 15, 2024 • edited Loading

zpcore commented Aug 16, 2024

zpcore commented Aug 16, 2024 • edited Loading

Support `dist.all_gather` related collective ops #7860

Support `dist.all_gather` related collective ops #7860

zpcore commented Aug 15, 2024 •

edited

Loading

zpcore commented Aug 16, 2024 •

edited

Loading