[SPMD] Add API to disable global SPMD config #8717

lsy323 · 2025-02-18T18:02:04Z

Before this PR, xr.use_spmd() set a un-revertible global SPMD state in torch_xla. (e.g. User cannot access devices after use_spmd, which will set a "Virtual Device" for the SPMD code path). This one-time SPMD setting limits the flexibility of having code regions running under SPMD mode and other running with non-SPMD mode.

This PR relaxes the above constraint by:

Introducing a new API disable_spmd() to revert the global SPMD setting from use_spmd.
Add an option in the current use_spmd() logic to not replicate all the non-SPMD live tensors on virtual device. This keeps normal tensors on its designated device after use_spmd() is called.

Implementation notes:

In the current implementation, the device information is stored in static variable, the value depends on UseVirtualDevice() querying function. Since the global device state will change as user switches between SPMD and non-SPMD mode, those values need to change accordingly. example1, example2
In the current implementation, _xla_force_spmd_device does 2 things: 1) Move all live non-SPMD tensors onto virtual device. 2) Set global SPMD config. Splitting the logic of 2) into a new API _set_spmd_mode(bool use_spmd) to manage the global SPMD config.

Test:

Add a test to switch between SPMD and non-SPMD mode, checking no unexpected data transfers in between.

yaochengji · 2025-02-19T04:22:44Z

test/spmd/test_disable_spmd.py

+          met.metric_data('TransferToDeviceTime')[0],
+          expected_transfer_to_device_counter[i])
+      spmd_output = self._run_spmd(spmd_model, spmd_input_shape, mesh)
+      spmd_outputs.append(spmd_output)


Is there any interaction between the SPMD and non-SPMD parts?

No, I cannot do that since we cannot create global tensor from device shards until #8716 is merged. But I do tested them together locally and it works.

Seems we should land #8716 first.

tengyifei · 2025-02-21T09:11:44Z

torch_xla/runtime.py

 # TODO(yeounoh) introduce SPMD configuration.
-def use_spmd(auto: Optional[bool] = False):
+def use_spmd(auto: Optional[bool] = False,
+             force_tensors_on_spmd_device: Optional[bool] = False):


nit: the word "force" is meaningless here IMO. I think a more descriptive name could be replicate_existing_tensors.

tengyifei · 2025-02-21T09:12:42Z

torch_xla/runtime.py

 # TODO(yeounoh) introduce SPMD configuration.
-def use_spmd(auto: Optional[bool] = False):
+def use_spmd(auto: Optional[bool] = False,
+             force_tensors_on_spmd_device: Optional[bool] = False):


If force_tensors_on_spmd_device defaults to False, would this end up being a backward compatibility breaking change? IIUC we used to always replicate existing tensors unconditionally.

tengyifei · 2025-02-21T09:15:20Z

torch_xla/runtime.py

 # TODO(yeounoh) introduce SPMD configuration.
-def use_spmd(auto: Optional[bool] = False):
+def use_spmd(auto: Optional[bool] = False,
+             force_tensors_on_spmd_device: Optional[bool] = False):
  """API to enable SPMD mode. This is a recommended way to enable SPMD.

  This forces SPMD mode if some tensors are already initialized on non-SPMD


This comment is probably out of date. Also I think whoever wrote this comment had in mind an idea of forcefully enabling SPMD mode, like that's an aggressive act. But with your PR there's nothing forceful about this anymore. IIUC people now have a clear option of replicating the existing tensors, or not replicating them (and keeping them on non-SPMD devices).

tengyifei · 2025-02-21T09:18:12Z

torch_xla/csrc/aten_xla_bridge.cpp

-  if (!g_current_device) {
-    g_current_device = *GetDefaultDevice();
-  }
+  g_current_device = *GetDefaultDevice();


If we're always overwriting g_current_device, wondering is it possible to rid of this global variable?

tengyifei · 2025-02-21T09:22:47Z

torch_xla/csrc/device.cpp

@@ -78,19 +78,19 @@ torch::lazy::BackendDevice GetVirtualDevice() {
 }

 bool ShouldUseVirtualDevice() {
-  bool use_virtual_device =
+  bool g_use_virtual_device =


This g_use_virtual_device variable shadows a global variable of the same name.

lsy323 · 2025-02-24T17:45:16Z

test_runtime_spmd_api is failing because, I did some investigation and here are the findings:

AtenXlaDeviceMapper has 2 states: a) Contains all local devices under non-spmd mode (e.g. TPU:0, TPU1...). b) Contains SPMD:0 under SPMD mode.

In this PR, we switch the 2 states as use_spmd and disable_spmd are called. However, in the current use_spmd logic, the existing live tensors will be moved onto SPMD virtual device, the BackendDataHandle is moved to SPMD virtual device, but the underlying backend device in LazyTensor cannot be updated due to constness ref.

In a concrete example:

Create some tensors under non-spmd mode, has device (XLA:0), AtenXlaDeviceMapper is initialized with devices XLA:0, XLA:1..
use_spmd is called, AtenXlaDeviceMapper is switch to state b); The Lazy Tensor state of non-SPMD tensors are still on XLA:0, the underlying BackendDataHandle is on SPMD:0.

Currently, it works by having physical devices in AtenXlaDeviceMapper even if SPMD is turned on after AtenXlaDeviceMapper is initialized. This doesn't seem to be expected but it works now.

My conclusion from the above:

Our current implementation is based on the fact that the state of AtenXlaDeviceMapper won't change once it's initialized, this contradicts with scenario of switching between spmd and non-spmd mode. To move forward, I think we need to make use_spmd working properly with a stateful AtenXlaDeviceMapper.

pgmoka · 2025-02-24T22:06:37Z

test/spmd/test_disable_spmd.py

+import torch_xla.runtime as xr
+local_bs = 4096 * 8


NIT: It is clearer if there is a space between imports, and global variables

add disable spmd api

de27158

lsy323 force-pushed the lsiyuan/disable-spmd branch from 359bbab to de27158 Compare February 19, 2025 00:28

lsy323 requested review from tengyifei, yaochengji and qihqi February 19, 2025 00:53

lsy323 marked this pull request as ready for review February 19, 2025 00:54

yaochengji reviewed Feb 19, 2025

View reviewed changes

tengyifei requested changes Feb 21, 2025

View reviewed changes

pgmoka reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPMD] Add API to disable global SPMD config #8717

[SPMD] Add API to disable global SPMD config #8717

lsy323 commented Feb 18, 2025 •

edited

Loading

yaochengji Feb 19, 2025

lsy323 Feb 19, 2025

tengyifei Feb 21, 2025

tengyifei Feb 21, 2025

tengyifei Feb 21, 2025

tengyifei Feb 21, 2025

tengyifei Feb 21, 2025

lsy323 commented Feb 24, 2025

pgmoka Feb 24, 2025

		import torch_xla.runtime as xr
		local_bs = 4096 * 8

[SPMD] Add API to disable global SPMD config #8717

Are you sure you want to change the base?

[SPMD] Add API to disable global SPMD config #8717

Conversation

lsy323 commented Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsy323 commented Feb 24, 2025

Choose a reason for hiding this comment

lsy323 commented Feb 18, 2025 •

edited

Loading