Support multihost SPMD execution #4573

jonb377 · 2023-02-06T18:16:35Z

The only main change to support multihost execution is to restrict the generated shards in ShardTensor to those which belong to addressable devices.

yeounoh · 2023-02-08T06:24:32Z

torch_xla/experimental/xla_sharding.py

@@ -88,7 +89,7 @@ def mark_sharding(t: Union[torch.Tensor, XLAShardedTensor], mesh: Mesh,
    Examples
    —------------------------------
    mesh_shape = (4, 2)
-    num_devices = len(xm.get_xla_supported_devices())
+    num_devices = pjrt.global_device_count()


yeounoh · 2023-02-08T06:26:10Z

torch_xla/csrc/xla_sharding_util.cpp

+    const std::vector<std::string>& devices) {
+  std::unordered_map<int, int> device_index;
+  for (int i = 0; i < devices.size(); ++i) {
+    int global_ordinal = ParseDeviceString(devices[i]).ordinal();


The first global device gets the local index 0, so the order of the input devices list is important. Is this a correct understanding? Can we add some comments on this?

The first device in the list gets local index 0, but the order of the global ordinals within devices doesn't matter. I'll add some more documentation around this.

yeounoh

LGTM, a minor comment.

yeounoh · 2023-02-10T22:57:04Z

torch_xla/csrc/tensor_util.cpp

@@ -931,30 +931,24 @@ std::vector<torch::lazy::BackendDataPtr> CreateTensorsData(
    std::vector<xla::ComputationClient::DataPtr> new_handles;          // out
    if (shardings[i] != nullptr) {
      xla::OpSharding sharding = shardings[i]->sharding;
-      // TODO(yeounoh) PJRT runs a process per host for SPMD and without cross
-      // host communications. This means that we may need to manually shard
-      // across global devices for multi-host training.
      std::vector<std::string> local_devices =


Does GetLocalDevices() return local devices with global ordinals? If so, let's leave a comment.

yeounoh

LGTM, have 2 comments --nit.

jonb377 requested a review from yeounoh February 6, 2023 18:16

jonb377 force-pushed the jonbolin-multihost-spmd branch 2 times, most recently from b86aaec to 9e5fb70 Compare February 7, 2023 19:36

yeounoh reviewed Feb 8, 2023

View reviewed changes

yeounoh suggested changes Feb 8, 2023

View reviewed changes

jonb377 force-pushed the jonbolin-multihost-spmd branch from 9e5fb70 to fe74d84 Compare February 8, 2023 17:34

jonb377 added the SPMD / Distributed label Feb 9, 2023

jonb377 force-pushed the jonbolin-multihost-spmd branch 3 times, most recently from ba5ed74 to e999a95 Compare February 10, 2023 19:12

yeounoh reviewed Feb 10, 2023

View reviewed changes

yeounoh approved these changes Feb 10, 2023

View reviewed changes

jonb377 force-pushed the jonbolin-multihost-spmd branch from e999a95 to d1fc7b1 Compare February 10, 2023 23:36

Support multihost SPMD execution

bc538a8

jonb377 force-pushed the jonbolin-multihost-spmd branch from d1fc7b1 to bc538a8 Compare February 11, 2023 00:17

jonb377 merged commit ced6456 into master Feb 11, 2023

jonb377 deleted the jonbolin-multihost-spmd branch February 11, 2023 22:19

chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023

Support multihost SPMD execution (#4573)

d4df8b3

chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023

Support multihost SPMD execution (#4573)

447e8cd

mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023

Support multihost SPMD execution (#4573)

2bbb364

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multihost SPMD execution #4573

Support multihost SPMD execution #4573

jonb377 commented Feb 6, 2023 •

edited

Loading

yeounoh Feb 8, 2023

yeounoh Feb 8, 2023

jonb377 Feb 8, 2023

yeounoh left a comment

yeounoh Feb 10, 2023

yeounoh left a comment

Support multihost SPMD execution #4573

Support multihost SPMD execution #4573

Conversation

jonb377 commented Feb 6, 2023 • edited Loading

yeounoh Feb 8, 2023

Choose a reason for hiding this comment

yeounoh Feb 8, 2023

Choose a reason for hiding this comment

jonb377 Feb 8, 2023

Choose a reason for hiding this comment

yeounoh left a comment

Choose a reason for hiding this comment

yeounoh Feb 10, 2023

Choose a reason for hiding this comment

yeounoh left a comment

Choose a reason for hiding this comment

jonb377 commented Feb 6, 2023 •

edited

Loading