PyTorch DTensor device mesh interface with device type "xla" fails at `get_rank()` #8528

awshaichen · 2025-01-03T20:12:38Z

🐛 Bug

After constructing a PyTorch DTensor device mesh object using torch.distributed._tensor.device_mesh.init_device_mesh, the device mesh object does not support querying the rank via the get_rank interface.

To Reproduce

# test_device_mesh_get_rank.py
import os
import subprocess
import unittest
from torch.distributed._tensor.device_mesh import init_device_mesh


class TestDeviceMeshGetRank(unittest.TestCase):

    def realtest(self):
        _world_size = int(os.environ["WORLD_SIZE"])
        device_type = os.environ.get("TEST_DEVICE_TYPE", 'xla')
        if device_type == 'xla':
            from torch_xla import runtime as xr
            xr.use_spmd()
        device_mesh = init_device_mesh(device_type=device_type, mesh_shape=(_world_size,))
        _rank = device_mesh.get_rank()
        assert _rank == int(os.environ["RANK"])

    def test_driver(self):
        if 'TEST_INTERNAL_IS_TORCHRUN' in os.environ:
            return self.realtest()
        device_count = 2
        env = os.environ.copy()
        env['TEST_INTERNAL_IS_TORCHRUN'] = '1'
        cmd = ['torchrun', '--nnodes=1', f'--nproc_per_node={device_count}', __file__]
        subprocess.check_call(cmd, env=env)


if __name__ == '__main__':
    unittest.main()

Steps to reproduce the behavior:

Save the above script as test_device_mesh_get_rank.py.
Executing env PJRT_DEVICE=CPU python test_device_mesh_get_rank.py under torch-xla 2.5.1 gives error message ValueError: Default process group has not been initialized, please make sure to call init_process_group..
In comparison, running env TEST_DEVICE_TYPE='cuda' python test_device_mesh_get_rank.py on CUDA PyTorch can pass the test.

Expected behavior

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CPU and the AWS Neuron PJRT plugin
torch_xla version: 2.5.1

Additional context

Was trying to adapt https://github.com/pytorch/examples/blob/1bef748/distributed/tensor_parallelism/tensor_parallel_example.py for the XLA device/mesh type.

The text was updated successfully, but these errors were encountered:

miladm · 2025-01-10T02:07:23Z

@bhavya01 to assist with this issue.

cc @JackCaoG

bhavya01 · 2025-01-12T01:12:23Z

The XLA backend for distributed tensors works slightly differently from native pytorch. XLA backend doesn't require creating a separate process for each device because the XLA compiler handles sharding the tensors according to the specified sharding spec.

That's why you don't see any process groups with the XLA backend here.

Please feel free to take a look at the DTensor integration RFC with XLA backend here pytorch/pytorch#92909 and let us know if you have any further questions.

The distribute_tensor and distribute_module APIs should work as expected.

bhavya01 · 2025-01-22T04:08:14Z

Closing this since no comment since last week. Please feel free to re-open.

jeffhataws · 2025-01-31T22:17:02Z

@bhavya01 when will pytorch/pytorch#92909 be completed and merged into mainline?

miladm assigned bhavya01 Jan 10, 2025

bhavya01 closed this as completed Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DTensor device mesh interface with device type "xla" fails at `get_rank()` #8528

PyTorch DTensor device mesh interface with device type "xla" fails at `get_rank()` #8528

awshaichen commented Jan 3, 2025 •

edited

Loading

miladm commented Jan 10, 2025

bhavya01 commented Jan 12, 2025

bhavya01 commented Jan 22, 2025

jeffhataws commented Jan 31, 2025

PyTorch DTensor device mesh interface with device type "xla" fails at get_rank() #8528

PyTorch DTensor device mesh interface with device type "xla" fails at get_rank() #8528

Comments

awshaichen commented Jan 3, 2025 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

miladm commented Jan 10, 2025

bhavya01 commented Jan 12, 2025

bhavya01 commented Jan 22, 2025

jeffhataws commented Jan 31, 2025

PyTorch DTensor device mesh interface with device type "xla" fails at `get_rank()` #8528

PyTorch DTensor device mesh interface with device type "xla" fails at `get_rank()` #8528

awshaichen commented Jan 3, 2025 •

edited

Loading