Make layout pining optional for cross core communication #3511

JackCaoG · 2022-04-19T03:15:00Z

This is to fix #3506. I verified that with all pin_layout=True(default) or all pin_layout=False, test will passed.

PyTorch/xla currently compile graph separately for each TPU core(or GPU core). It is possible that graph being generated in each core is slightly different due to

slightly different input shape
different embedding table size
...

XLA compiler generate can tolerate small different among different cores but this will be a problem for communication ops. If the input shape difference ended up resulting in a layout difference among tensors that user want to call communication op, there can be a data corruption.

To overcome this problem we introduce the layout pining, which guarantee that all cores that participate in communication has the same layout for input tensor. However in some corner cases, pinging all layout will not work. For example all_gather(pin) + reduce_scatter(pin) might fail in some case.

This pr aim to provide a workaround when such failure happened. PyTorch/XLA will pin all communcation op layout by defualt, but if there is a compilation error with message HloModule has a mix of layout constrained user can choose to unpin all layout.

FYI @ronghanghu @hjm-aws

JackCaoG · 2022-04-19T03:15:37Z

I will merge this pr when all test passed to unblock the user, but I want someone to review it too. I can have a follow up pr to fix the review comment. @yeounoh

ronghanghu · 2022-04-19T04:00:22Z

Thanks! Just to double-check, before this PR, we currently have the following behavior:

all_reduce: pinned
all_to_all: pinned
all_gather: unpinned
reduce_scatter: unpinned

Is this right? (Asking as I'm trying to understand what I need to do if I want to re-run the tests under the current behavior.)

JackCaoG · 2022-04-19T04:21:57Z

@ronghanghu yea, your statement is correct.

JackCaoG · 2022-04-19T07:43:26Z

I modify the default parameter and only pin the all_reduce layout by default to make test_mp_distributed_mm.py work. This is also what Blake suggested.

…(as a better workaround to pytorch#3510)

Make layout pining optional for cross core communication

48185ab

JackCaoG requested a review from yeounoh April 19, 2022 03:15

ronghanghu mentioned this pull request Apr 19, 2022

New all-gather API takes much more memory than 1.10 all-gather implementation via all-reduce #3510

Open

Only pin all_reduce layout

e4e01e4

JackCaoG force-pushed the layout_pin_optional branch from fcdbd24 to e4e01e4 Compare April 19, 2022 07:45

JackCaoG merged commit 5ece4ca into master Apr 19, 2022

JackCaoG deleted the layout_pin_optional branch April 19, 2022 07:53

ronghanghu added a commit to ronghanghu/xla that referenced this pull request Apr 20, 2022

revert to all_gather_via_all_reduce and pin layout with pytorch#3511 …

9f20a2d

…(as a better workaround to pytorch#3510)

ronghanghu mentioned this pull request May 8, 2022

allow disabling layout pinning in optimizer_step and reduce_gradients #3556

Merged

ronghanghu mentioned this pull request Aug 3, 2022

TPU Pod support with PjRt #3813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make layout pining optional for cross core communication #3511

Make layout pining optional for cross core communication #3511

JackCaoG commented Apr 19, 2022

JackCaoG commented Apr 19, 2022

ronghanghu commented Apr 19, 2022 •

edited

Loading

JackCaoG commented Apr 19, 2022

JackCaoG commented Apr 19, 2022

Make layout pining optional for cross core communication #3511

Make layout pining optional for cross core communication #3511

Conversation

JackCaoG commented Apr 19, 2022

JackCaoG commented Apr 19, 2022

ronghanghu commented Apr 19, 2022 • edited Loading

JackCaoG commented Apr 19, 2022

JackCaoG commented Apr 19, 2022

ronghanghu commented Apr 19, 2022 •

edited

Loading