From 4dd406468b861ff3767a7ab4cb5e108aa23bd31a Mon Sep 17 00:00:00 2001
From: iefgnoix <isaacwxf23@gmail.com>
Date: Thu, 7 Mar 2024 00:38:08 +0000
Subject: [PATCH 1/2] add spmd doc

---
 docs/pjrt.md |  8 ++------
 docs/spmd.md | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/docs/pjrt.md b/docs/pjrt.md
index 1d0b50fff6d8..481bdaa971d6 100644
--- a/docs/pjrt.md
+++ b/docs/pjrt.md
@@ -28,8 +28,6 @@ _New features in PyTorch/XLA r2.0_:
 * New `xm.rendezvous` implementation that scales to thousands of TPU cores
 * [experimental] `torch.distributed` support for TPU v2 and v3, including
   `pjrt://` `init_method`
-* [experimental] Single-host GPU support in PJRT. Multi-host support coming
-  soon!
 
 ## TL;DR
 
@@ -192,8 +190,6 @@ for more information.
 
 ### GPU
 
-*Warning: GPU support is still highly experimental!*
-
 ### Single-node GPU training
 
 To use GPUs with PJRT, simply set `PJRT_DEVICE=CUDA` and configure
@@ -235,7 +231,7 @@ For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on
 --nnodes=2 \
 --node_rank=0 \
 --nproc_per_node=4 \
---rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
 ```
 
 On the second GPU machine, run
@@ -245,7 +241,7 @@ On the second GPU machine, run
 --nnodes=2 \
 --node_rank=1 \
 --nproc_per_node=4 \
---rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
 ```
 
 the difference between the 2 commands above are `--node_rank` and potentially `--nproc_per_node` if you want to use different number of GPU devices on each machine. All the rest are identical. For more information about `torchrun`, please refer to this [page](https://pytorch.org/docs/stable/elastic/run.html).
diff --git a/docs/spmd.md b/docs/spmd.md
index c61ff0808a9d..ae26e494eaaf 100644
--- a/docs/spmd.md
+++ b/docs/spmd.md
@@ -357,6 +357,48 @@ Unlike existing DDP and FSDP, under the SPMD mode, there is always a single proc
 There is no code change required to go from single TPU host to TPU Pod if you construct your mesh and partition spec based on the number of devices instead of some hardcode constant. To run the PyTorch/XLA workload on TPU Pod, please refer to the [Pods section](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#pods) of our PJRT guide.
 
 
+### Running SPMD on GPU
+
+PyTorch/XLA supports running SPMD on NVIDIA GPU (single-node or multi-nodes). The training/inference script remains the same as the one used for TPU, such as this [ResNet script](https://github.com/pytorch/xla/blob/1dc78948c0c9d018d8d0d2b4cce912552ab27083/test/spmd/test_train_spmd_imagenet.py). To execute the script using SPMD, we leverage `torchrun`:
+
+```
+PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=${NUM_GPU_MACHINES} \
+--node_rank=${RANK_OF_CURRENT_MACHINE} \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" \
+training_or_inference_script_using_spmd.py
+```
+- `--nnodes`: how many GPU machines to be used.
+- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
+- `--nproc_per_node`: the value must be 1 due to the SPMD requirement.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form :. The host will be the internal IP address. The port can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
+
+For example, if you want to train a ResNet model on 2 GPU machines using SPMD, you can run the script below on the first machine:
+```
+XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=2 \
+--node_rank=0 \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
+pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
+```
+and run the following on the second machine:
+```
+XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
+torchrun \
+--nnodes=2 \
+--node_rank=1 \
+--nproc_per_node=1 \
+--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
+pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
+```
+
+For more information, please refer to the [SPMD support on GPU RFC](https://github.com/pytorch/xla/issues/6256).
+
+
 ## Reference Examples
 
 

From 300bc8c1720aec38279eb6d72cbfa5e564a76ce3 Mon Sep 17 00:00:00 2001
From: iefgnoix <isaacwxf23@gmail.com>
Date: Thu, 7 Mar 2024 00:44:43 +0000
Subject: [PATCH 2/2] revised the doc

---
 docs/pjrt.md | 2 +-
 docs/spmd.md | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/pjrt.md b/docs/pjrt.md
index 481bdaa971d6..7262339a5b87 100644
--- a/docs/pjrt.md
+++ b/docs/pjrt.md
@@ -222,7 +222,7 @@ PJRT_DEVICE=CUDA torchrun \
 - `--nnodes`: how many GPU machines to be used.
 - `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
 - `--nproc_per_node`: the number of GPU devices to be used on the current machine.
-- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The `host` will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
 
 For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on the first GPU machine machine_0, run
 
diff --git a/docs/spmd.md b/docs/spmd.md
index ae26e494eaaf..89d206e933f2 100644
--- a/docs/spmd.md
+++ b/docs/spmd.md
@@ -359,7 +359,7 @@ There is no code change required to go from single TPU host to TPU Pod if you co
 
 ### Running SPMD on GPU
 
-PyTorch/XLA supports running SPMD on NVIDIA GPU (single-node or multi-nodes). The training/inference script remains the same as the one used for TPU, such as this [ResNet script](https://github.com/pytorch/xla/blob/1dc78948c0c9d018d8d0d2b4cce912552ab27083/test/spmd/test_train_spmd_imagenet.py). To execute the script using SPMD, we leverage `torchrun`:
+PyTorch/XLA supports SPMD on NVIDIA GPU (single-node or multi-nodes). The training/inference script remains the same as the one used for TPU, such as this [ResNet script](https://github.com/pytorch/xla/blob/1dc78948c0c9d018d8d0d2b4cce912552ab27083/test/spmd/test_train_spmd_imagenet.py). To execute the script using SPMD, we leverage `torchrun`:
 
 ```
 PJRT_DEVICE=CUDA \
@@ -367,13 +367,13 @@ torchrun \
 --nnodes=${NUM_GPU_MACHINES} \
 --node_rank=${RANK_OF_CURRENT_MACHINE} \
 --nproc_per_node=1 \
---rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" \
+--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:<PORT>" \
 training_or_inference_script_using_spmd.py
 ```
 - `--nnodes`: how many GPU machines to be used.
 - `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
 - `--nproc_per_node`: the value must be 1 due to the SPMD requirement.
-- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form :. The host will be the internal IP address. The port can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The host will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
 
 For example, if you want to train a ResNet model on 2 GPU machines using SPMD, you can run the script below on the first machine:
 ```