[SPMD] Hybrid Device mesh creation #5147

khatwanimohit · 2023-06-08T20:33:00Z

No description provided.

khatwanimohit · 2023-06-08T20:33:41Z

cc @alanwaketan

torch_xla/experimental/xla_sharding.py

jonb377

Looking great Mohit! Could we also add some basic unit tests in https://github.com/pytorch/xla/blob/master/test/spmd/test_xla_sharding.py?

torch_xla/experimental/xla_sharding.py

alanwaketan

It will help me easier to read your code if you can briefly describe what you did in the PR especially on describing some of the key complex functionalities in the code, like _create_hybrid_device_mesh.

Or you can leave a comment on the code.

torch_xla/experimental/xla_sharding.py

alanwaketan · 2023-06-12T21:27:03Z

torch_xla/experimental/xla_sharding.py

+      out[coords[0], coords[1], coords[2]] = d
+    return out
+
+  def _create_device_mesh_for_nd_torus(


Can you explain how this function optimize the performance according to the TPU physical topology? What's the algorithm? Is it the inner ring has the highest performance, so we should assign the back of the mesh_shape to it?

Speaking with Mohit offline. The rule is that the TPU topology is always 3D. And the inner 2D tensors have a faster ICI than the ones connect across them. Therefore, we should group the most speed demanding rank, i.e., highest rank of the mesh, to the inner 2D tensors.

Now that I read more into the code. This algorithm seems quite restrict:

It only works with mapping a 2D or 3D logical mesh into the 3D physical mesh.

Then for 3D mesh, I think the logical mesh needs to be a transpose of the physical mesh.

Then for 2D mesh, it's just trying to map a combination of the axes into each of the dimension of the logical mesh.

After these simple rules, it then makes sure that devices that are physically close to each other are assigned close to each other in the logical mesh as well. For example, assuming the logical mesh is 2D, the devices that are in mesh[0] are always be a 2D slice of the 3D physical mesh.

If my understanding is correct, @khatwanimohit can you polish my comments and make it into the comment of this helper?

You can add:

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L64.

test/spmd/test_xla_sharding.py

alanwaketan · 2023-06-14T22:11:52Z

test/spmd/test_xla_sharding.py

+    hybrid_mesh = xs.HybridMesh(
+        ici_mesh_shape=(1, 4), dcn_mesh_shape=(num_slices, 1))
+    print(hybrid_mesh.get_logical_mesh())
+    self.assertEqual(hybrid_mesh.get_logical_mesh().tolist(),


Does this result respect the _create_device_mesh_for_nd_torus algorithm?

Yes, I have confirmed this with the jax's mesh

Can you make the ici_mesh_shap=(2, 2)? I think that can better show how the algorithm works?

Changed ici_mesh_shape

alanwaketan

I just noticed that most of the helpers @khatwanimohit you introduced are inspired by https://github.com/google/jax/blob/bfe8acb31e04a540daad3f568239ec0e5c3f0d0f/jax/experimental/mesh_utils.py. And in fact, all those helpers have a very nice docstring to explain what the helpers are doing.

I recommend next time if you are going to import some JAX utils to PyTorch/XLA, you'd better:

List the source on each utils you imported.
Import their docstring as well. Those are really critical for the readability of the code.

Also, have you checked the licenses to make sure that you can copy code from JAX into PyTorch/XLA? If not, I can do the research for you.

alanwaketan

Mostly looking good to me. Thanks, @khatwanimohit.

Please address the comments on readability.

alanwaketan · 2023-06-16T20:59:20Z

torch_xla/experimental/xla_sharding.py


  def get_logical_mesh(self):
    return self.device_ids.reshape(self.mesh_shape)


+# HybridDevice class has been inspired from jax's mesh_utils: https://github.com/google/jax/blob/fc5960f2b8b7a0ef74dbae4e27c5c08ff1564cff/jax/experimental/mesh_utils.py#L4


Can you make it per helper that you imported?

alanwaketan · 2023-06-16T21:00:24Z

torch_xla/experimental/xla_sharding.py

+      out[coords[0], coords[1], coords[2]] = d
+    return out
+
+  def _create_device_mesh_for_nd_torus(


You can add:

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L64.

alanwaketan · 2023-06-16T21:11:04Z

torch_xla/experimental/xla_sharding.py

+    super().__init__(device_ids, mesh_shape, axis_names)
+
+  def _get_physical_tpu_mesh(self, devices: Sequence[Any]) -> np.ndarray:
+    r"""Rearrange TPU devices in a slice into a physical mesh."""


Can you add:
1.

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L172

The following description of the function:

r"""Rearrange TPU devices in a slice into a physical mesh. Args: devices: A list of device logical ordinals in a TPU slice. Returns: A np.ndarray of device logical ordinals with shape [global_x, global_y, global_z]. On v2 and v3, global_z is instead cores_per_chip (i.e., 2). """

alanwaketan · 2023-06-16T21:15:51Z

torch_xla/experimental/xla_sharding.py

+        physical_mesh, mesh_shape)
+    return device_mesh
+
+  def _create_hybrid_device_mesh(self, ici_mesh_shape: Sequence[int],


Can you add:
1.

This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L288.

And the follow function description:

"""Creates a device mesh for hybrid (e.g., ICI and DCN) parallelism. Args: ici_mesh_shape: shape of the logical mesh for the faster/inner network, ordered by increasing network intensity, e.g. [replica, data, mdl] where mdl has the most network communication requirements. dcn_mesh_shape: shape of the logical mesh for the slower/outer network, in the same order as mesh_shape. Returns: A np.ndarray of device logical ordinal with ici_mesh_shape * dcn_mesh_shape as its shape that can be fed into HybridMesh for hybrid parallelism. """

alanwaketan · 2023-06-16T21:31:02Z

torch_xla/experimental/xla_sharding.py

+    return physical_mesh.transpose(transpose).reshape(mesh_shape), assignment
+
+  # This is imported from JAX: https://github.com/google/jax/blob/main/jax/experimental/mesh_utils.py#L231
+  def _create_device_mesh(self,


I didn't mention this one given your logic is quite different. I suggest you can undo it.

Fixed the comment

alanwaketan

LGTM. Thanks, Mohit.

will-cromar · 2023-06-20T16:25:49Z

The TPU CI broke after this PR merged. Is this related?

Step #4 - "run_e2e_tests": ======================================================================
Step #4 - "run_e2e_tests": ERROR: test_hybrid_mesh_shape (__main__.BasicShardingTest)
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Traceback (most recent call last):
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/spmd/test_xla_sharding.py", line 462, in test_hybrid_mesh_shape
Step #4 - "run_e2e_tests":     hybrid_mesh = self._get_hybrid_mesh((1, self.n_devices))
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/spmd/test_xla_sharding_base.py", line 42, in _get_hybrid_mesh
Step #4 - "run_e2e_tests":     return xs.HybridMesh(ici_mesh_shape=ici_mesh_shape)
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 122, in __init__
Step #4 - "run_e2e_tests":     mesh = self._create_device_mesh(self.ici_mesh_shape)
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 257, in _create_device_mesh
Step #4 - "run_e2e_tests":     device_mesh, assignment = self._create_device_mesh_for_nd_torus(
Step #4 - "run_e2e_tests":   File "/usr/local/lib/python3.8/site-packages/torch_xla/experimental/xla_sharding.py", line 220, in _create_device_mesh_for_nd_torus
Step #4 - "run_e2e_tests":     raise NotImplementedError(
Step #4 - "run_e2e_tests": NotImplementedError: Failed to find assignment for logical_axis_index 1 of size 8 with remaining assignable mesh [2, 2, 1]. The size of each axis in your logical mesh must be equal to the product of some subset of the physical mesh axis sizes. E.g logical mesh (4, 16) is compatible with physical mesh 4x4x4 since 4=4 and 16=4x4.
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Ran 26 tests in 0.968s
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": FAILED (errors=1)
Step #4 - "run_e2e_tests": [[0 1]
Step #4 - "run_e2e_tests":  [2 3]
Step #4 - "run_e2e_tests":  [4 5]
Step #4 - "run_e2e_tests":  [6 7]]
Step #4 - "run_e2e_tests": ++ kubectl get pod/xla-test-job-kl46l -o 'jsonpath={.status.containerStatuses[?(@.name=="xla-test")].state.terminated.exitCode}'

alanwaketan · 2023-06-20T19:26:30Z

Let's have a follow up to disable the test for TPU.

You can do that by following: https://github.com/pytorch/xla/blob/master/test/test_zero1.py#L13

hybrid_mesh class

b37d0e3

khatwanimohit requested review from yeounoh and jonb377 June 8, 2023 20:33

jonb377 reviewed Jun 8, 2023

View reviewed changes

torch_xla/experimental/xla_sharding.py Outdated Show resolved Hide resolved

torch_xla/experimental/xla_sharding.py Outdated Show resolved Hide resolved

torch_xla/experimental/xla_sharding.py Show resolved Hide resolved

torch_xla/experimental/xla_sharding.py Outdated Show resolved Hide resolved

khatwanimohit added 2 commits June 9, 2023 00:39

hybrid_mesh class

3bd182c

remove inherited fields

ad91169

jonb377 reviewed Jun 9, 2023

View reviewed changes

torch_xla/experimental/xla_sharding.py Outdated Show resolved Hide resolved

torch_xla/experimental/xla_sharding.py Outdated Show resolved Hide resolved

alanwaketan reviewed Jun 9, 2023

View reviewed changes

khatwanimohit and others added 5 commits June 10, 2023 00:41

hybrid_mesh class

7b264ca

lint fix

9f6d86c

Merge branch 'master' into mohit/hybrid_mesh

8f55df8

fix test

c457f6c

fix lint

d71df3a

alanwaketan reviewed Jun 12, 2023

View reviewed changes

khatwanimohit added 2 commits June 13, 2023 18:54

add unit test

ef665e9

skip test

9c6d8ab

khatwanimohit force-pushed the mohit/hybrid_mesh branch from ce4f052 to 9c6d8ab Compare June 13, 2023 21:30

jonb377 reviewed Jun 13, 2023

View reviewed changes

test/spmd/test_xla_sharding.py Outdated Show resolved Hide resolved

fix test

632cbbb

alanwaketan reviewed Jun 14, 2023

View reviewed changes

alanwaketan requested changes Jun 15, 2023

View reviewed changes

add comments

572548b

khatwanimohit force-pushed the mohit/hybrid_mesh branch from 79336d3 to 572548b Compare June 16, 2023 17:35

alanwaketan reviewed Jun 16, 2023

View reviewed changes

comments

640d0b3

alanwaketan reviewed Jun 16, 2023

View reviewed changes

fix comment

abf04dc

alanwaketan approved these changes Jun 16, 2023

View reviewed changes

khatwanimohit merged commit 60a6d60 into master Jun 19, 2023

ManfeiBai pushed a commit that referenced this pull request Jun 22, 2023

[SPMD] Hybrid Device mesh creation (#5147)

2a2e563

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPMD] Hybrid Device mesh creation #5147

[SPMD] Hybrid Device mesh creation #5147

khatwanimohit commented Jun 8, 2023

khatwanimohit commented Jun 8, 2023

jonb377 left a comment

alanwaketan left a comment •

edited

Loading

alanwaketan Jun 12, 2023

alanwaketan Jun 15, 2023

alanwaketan Jun 15, 2023

alanwaketan Jun 16, 2023

alanwaketan Jun 14, 2023

khatwanimohit Jun 14, 2023

alanwaketan Jun 15, 2023

khatwanimohit Jun 16, 2023

alanwaketan left a comment

alanwaketan left a comment

alanwaketan Jun 16, 2023

alanwaketan Jun 16, 2023

alanwaketan Jun 16, 2023

alanwaketan Jun 16, 2023

alanwaketan Jun 16, 2023

khatwanimohit Jun 16, 2023

alanwaketan left a comment

will-cromar commented Jun 20, 2023

alanwaketan commented Jun 20, 2023

[SPMD] Hybrid Device mesh creation #5147

[SPMD] Hybrid Device mesh creation #5147

Conversation

khatwanimohit commented Jun 8, 2023

khatwanimohit commented Jun 8, 2023

jonb377 left a comment

Choose a reason for hiding this comment

alanwaketan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

will-cromar commented Jun 20, 2023

alanwaketan commented Jun 20, 2023

alanwaketan left a comment •

edited

Loading