pytorch · vanbasten23 · Mar 7, 2024 · Mar 7, 2024 · Mar 7, 2024
@@ -28,8 +28,6 @@ _New features in PyTorch/XLA r2.0_:
 * New `xm.rendezvous` implementation that scales to thousands of TPU cores
 * [experimental] `torch.distributed` support for TPU v2 and v3, including
   `pjrt://` `init_method`
-* [experimental] Single-host GPU support in PJRT. Multi-host support coming
-  soon!
 
 ## TL;DR
 
@@ -192,8 +190,6 @@ for more information.
 
 ### GPU
 
-*Warning: GPU support is still highly experimental!*
-
 ### Single-node GPU training
 
 To use GPUs with PJRT, simply set `PJRT_DEVICE=CUDA` and configure
@@ -226,7 +222,7 @@ PJRT_DEVICE=CUDA torchrun \
 - `--nnodes`: how many GPU machines to be used.
 - `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
 - `--nproc_per_node`: the number of GPU devices to be used on the current machine.
-- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The `host` will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
 
 For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on the first GPU machine machine_0, run
 
@@ -235,7 +231,7 @@ For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on
 --nnodes=2 \
 --node_rank=0 \
 --nproc_per_node=4 \
---rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
 ```
 
 On the second GPU machine, run
@@ -245,7 +241,7 @@ On the second GPU machine, run
 --nnodes=2 \
 --node_rank=1 \
 --nproc_per_node=4 \
---rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
 ```
 
 the difference between the 2 commands above are `--node_rank` and potentially `--nproc_per_node` if you want to use different number of GPU devices on each machine. All the rest are identical. For more information about `torchrun`, please refer to this [page](https://pytorch.org/docs/stable/elastic/run.html).

@@ -357,6 +357,48 @@ Unlike existing DDP and FSDP, under the SPMD mode, there is always a single proc
 There is no code change required to go from single TPU host to TPU Pod if you construct your mesh and partition spec based on the number of devices instead of some hardcode constant. To run the PyTorch/XLA workload on TPU Pod, please refer to the [Pods section](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#pods) of our PJRT guide.
 
 
+### Running SPMD on GPU
+
+PyTorch/XLA supports SPMD on NVIDIA GPU (single-node or multi-nodes). The training/inference script remains the same as the one used for TPU, such as this [ResNet script](https://github.com/pytorch/xla/blob/1dc78948c0c9d018d8d0d2b4cce912552ab27083/test/spmd/test_train_spmd_imagenet.py). To execute the script using SPMD, we leverage `torchrun`:
+
+```
+PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=${NUM_GPU_MACHINES} \
+--node_rank=${RANK_OF_CURRENT_MACHINE} \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:<PORT>" \
+training_or_inference_script_using_spmd.py
+```
+- `--nnodes`: how many GPU machines to be used.
+- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
+- `--nproc_per_node`: the value must be 1 due to the SPMD requirement.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The host will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
+
+For example, if you want to train a ResNet model on 2 GPU machines using SPMD, you can run the script below on the first machine:
+```
+XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=2 \
+--node_rank=0 \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
+pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
+```
+and run the following on the second machine:
+```
+XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=2 \
+--node_rank=1 \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
+pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
+```
+
+For more information, please refer to the [SPMD support on GPU RFC](https://github.com/pytorch/xla/issues/6256).
+
+
 ## Reference Examples