[SPMD] Support manual all-reduce #7576

alanwaketan · 2024-06-26T01:35:12Z

Summary:
This is to add manual all-reduce support to SPMD and it currently only supports one input tensor. For array support, we can do that in python layer instead.

Test Plan:
python ./test/spmd/test_xla_sharding.py -v -k test_spmd_all_reduce

JackCaoG

approve to unblock, but I think we should fix the tensor method name

JackCaoG · 2024-06-26T02:05:57Z

torch_xla/csrc/tensor_methods.cpp

@@ -392,6 +392,13 @@ void all_reduce(const std::vector<XLATensorPtr>& inputs,
  }
 }

+XLATensorPtr all_reduce(const XLATensorPtr& input, AllReduceType reduce_type,


can you call it all_reduce _no_token, the only difference in signature is it does not take pin_layout but the main difference in the op is that it does not set token.. It is better to reflect that in the name.

Sure. I can follow up with that.

JackCaoG · 2024-06-26T02:08:53Z

for array support do you plan to call all_reduce multiple times? In our C++ implementation I think we group tensors by dtype and call all_rduce once per dtype.

alanwaketan · 2024-06-26T04:44:56Z

for array support do you plan to call all_reduce multiple times? In our C++ implementation I think we group tensors by dtype and call all_rduce once per dtype.

I don't think that's necessary. I'm thinking the compiler should be smart enough to fuse all-reduces if the fusion is necessary.

alanwaketan · 2024-06-26T04:46:37Z

Thanks Jack for approving.

alanwaketan added 2 commits June 26, 2024 01:32

initiial commit

d3795b3

Fix linters

7f7739a

alanwaketan added the backport_2.4 label Jun 26, 2024

alanwaketan requested review from jonb377 and JackCaoG June 26, 2024 01:35

alanwaketan self-assigned this Jun 26, 2024

JackCaoG approved these changes Jun 26, 2024

View reviewed changes

alanwaketan merged commit 0df5c29 into master Jun 26, 2024
23 checks passed

alanwaketan deleted the alanwaketan/spmd_all_reduce branch June 26, 2024 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPMD] Support manual all-reduce #7576

[SPMD] Support manual all-reduce #7576

alanwaketan commented Jun 26, 2024

JackCaoG left a comment

JackCaoG Jun 26, 2024

alanwaketan Jun 26, 2024

JackCaoG commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

[SPMD] Support manual all-reduce #7576

[SPMD] Support manual all-reduce #7576

Conversation

alanwaketan commented Jun 26, 2024

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG Jun 26, 2024

Choose a reason for hiding this comment

alanwaketan Jun 26, 2024

Choose a reason for hiding this comment

JackCaoG commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

alanwaketan commented Jun 26, 2024