Add multi-host GPU support #5657

vanbasten23 · 2023-09-29T00:31:10Z

Collaborating with @wbmc

To start the multi-host (or multi-node) training, do:

PJRT_DEVICE=GPU torchrun \
--nproc_per_node=${NUMBER_LOCAL_GPU_DEVICES} \
--nnodes=${NUMBER_GPU_VM} \
--node_rank=${CURRENT_HOST_RANK} \
--rdzv_endpoint=<internal_ip_address> multinode_training_script.py

on each host.

The documentation will be updated in a follow-up PR.

JackCaoG · 2023-09-29T00:35:07Z

@wbmc I have granted you the write access

wbmc · 2023-09-29T01:55:57Z

@wbmc I have granted you the write access

OK, Thanks!

torch_xla/csrc/init_python_bindings.cpp

torch_xla/csrc/runtime/pjrt_computation_client.cc

miladm · 2023-10-10T18:13:36Z

torch_xla/csrc/init_python_bindings.cpp

          std::string dist_service_addr =
-              runtime::sys_util::GetEnvString("PJRT_DIST_SERVICE_ADDR", "");
+              runtime::sys_util::GetEnvString("MASTER_ADDR", "127.0.0.1") + ":" + port;


For better readability, can we introduce a variable to describe this IP address is the default parameter? e.g. LOCAL_HOST_IP_DEFAULT

test/test_train_mp_imagenet_torchrun.py

jonb377

Looking great Xiongfei!

test/pjrt/test_torchrun.py

test/test_train_mp_imagenet_torchrun.py

torch_xla/_internal/gpu.py

torch_xla/csrc/init_python_bindings.cpp

torch_xla/csrc/runtime/pjrt_computation_client.cc

jonb377 · 2023-10-12T17:14:10Z

torch_xla/csrc/runtime/pjrt_computation_client.cc

    auto distributed_client =
-        MaybeInitializeDistributedRuntimeClient(local_rank, dist_service_addr);
+        MaybeInitializeDistributedRuntimeClient(global_rank);
    auto allowed_devices =
        std::make_optional<std::set<int>>(std::set{local_rank});


We need to generalize this to support CUDA_VISIBLE_DEVICES and single-process-multi-device

Yeah, we initially plan to incorporate CUDA_VISIBLE_DEVICES into this PR but we encountered some errors such as #5558 (comment) and #5558 (comment). We still plan to do it but probably in a follow-up pr

torch_xla/csrc/init_python_bindings.cpp

jonb377 · 2023-10-12T17:17:16Z

torch_xla/_internal/gpu.py

@@ -14,7 +15,8 @@ def num_local_processes() -> int:
  """
  assert xenv.GPU_NUM_DEVICES in os.environ, \
      "Must set `GPU_NUM_DEVICES` environment variable to use the PjRt GPU client"
-  return int(os.environ[xenv.GPU_NUM_DEVICES])
+  os.environ[xenv.WORLD_SIZE] = os.environ[xenv.GPU_NUM_DEVICES]


Is this right? We'll clobber the torchrun-set world size. Also wondering if we need to keep GPU_NUM_DEVICES in the first place

This change is mainly to make single-host case work with spawn (e.g. the tests in pytorch/xla/test/pjrt/test_runtime_gpu.py in which we use spawn). To provide a similar UX as PyTorch, we should still support spawn for the single host case (fwiw, PyTorch supports it as https://screenshot.googleplex.com/7nKD68dXNUUskF7). But I like your idea of replacing "GPU_NUM_DEVICES" with "WORLD_SIZE".

Also, torchrun doesn't invoke spawn so this function wouldn't be called hence it doesn't overwrite the torchrun-set world size.

So how about I replace GPU_NUM_DEVICES with WORLD_SIZE ?

WORLD_SIZE isn't quite right here, since this function returns the expected number of local processes. In torchrun, that's set as LOCAL_WORLD_SIZE. In xmp.spawn, we had to set it as something different like PJRT_LOCAL_WORLD_SIZE because LOCAL_WORLD_SIZE caused some issue with a third party package.

So maybe os.environ.get('LOCAL_WORLD_SIZE') or os.environ.get('PJRT_LOCAL_WORLD_SIZE')? It's clunky, but it covers both cases.

To use a subset of local GPUs with xmp.spawn, a user could set then LOCAL_WORLD_SIZE themselves instead of GPU_NUM_DEVICES.

Agreed that WORLD_SIZE is confusing. OTOH, WORLD_SIZE is what PyTorch single-node training uses: https://screenshot.googleplex.com/BH27HAYTpbNU4KA, if we want to stay closer to PyTorch.

I think clarity is more important here. How do you think we replace GPU_NUM_DEVICES with PJRT_LOCAL_WORLD_SIZE (or something like PT_XLA_LOCAL_WORLD_SIZE if we don't want to leak the underlying implementation detail), so the way we run single-host-multi-GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py ... turns to PJRT_LOCAL_WORLD_SIZE=4 python3 xla/test/test_train_mp_imagenet.p ..? @will-cromar @jonb377

I think it's okay to rely on the LOCAL_WORLD_SIZE env var here. We know torchrun will set it for sure, and the manual single-host-multi-GPU case can become PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=4 python script.py*. We can get rid of GPU_NUM_DEVICES.

* As a follow up, I would like to implement more automatic configuration like we have with TPUs so users don't have to set anything in the default case

We definitely don't want to override torchrun's settings here.

I forgot that PJRT_LOCAL_WORLD_SIZE is set after this function is called (and probably just set to the output of this function). So we can ignore that variable here.

Yeah, I used LOCAL_WORLD_SIZE here.

I'll replace GPU_NUM_DEVICES with LOCAL_WORLD_SIZE in a follow-up PR.

torch_xla/_internal/pjrt.py

torch_xla/core/xla_env_vars.py

torch_xla/_internal/pjrt.py

torch_xla/csrc/runtime/pjrt_computation_client.cc

will-cromar

Great work! I'm really excited to see your progress here.

While this PR is in review, can you start updating the documentation for GPUs in this repository? We should have some documentation covering how/why the GPU runtime works and make it clear that we expect people to use torchrun for multihost use cases.

torch_xla/_internal/pjrt.py

torch_xla/csrc/runtime/pjrt_computation_client.cc

will-cromar · 2023-10-14T03:04:41Z

torch_xla/_internal/gpu.py

@@ -14,7 +15,8 @@ def num_local_processes() -> int:
  """
  assert xenv.GPU_NUM_DEVICES in os.environ, \
      "Must set `GPU_NUM_DEVICES` environment variable to use the PjRt GPU client"
-  return int(os.environ[xenv.GPU_NUM_DEVICES])
+  os.environ[xenv.WORLD_SIZE] = os.environ[xenv.GPU_NUM_DEVICES]


I think it's okay to rely on the LOCAL_WORLD_SIZE env var here. We know torchrun will set it for sure, and the manual single-host-multi-GPU case can become PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=4 python script.py*. We can get rid of GPU_NUM_DEVICES.

* As a follow up, I would like to implement more automatic configuration like we have with TPUs so users don't have to set anything in the default case

We definitely don't want to override torchrun's settings here.

I forgot that PJRT_LOCAL_WORLD_SIZE is set after this function is called (and probably just set to the output of this function). So we can ignore that variable here.

vanbasten23 · 2023-10-16T18:30:43Z

Great work! I'm really excited to see your progress here.

While this PR is in review, can you start updating the documentation for GPUs in this repository? We should have some documentation covering how/why the GPU runtime works and make it clear that we expect people to use torchrun for multihost use cases.

I added the documentation in #5704. Feel free to take a look as well. I'll do some testing for this feature and may make some fixes if necessary. Once the feature is more stable, I'll merge the GPU documentation PR.

torch_xla/csrc/init_python_bindings.cpp

will-cromar

Just a couple of minor nits still open. Otherwise LGTM!

jonb377

LGTM, thanks Xiongfei!

torch_xla/_internal/pjrt.py

torch_xla/csrc/init_python_bindings.cpp

torch_xla/csrc/runtime/pjrt_computation_client.cc

* add prints * to be continued. * made torchrun works on single host * Add an example of resnet torchrun * add prints * use local rank for allowed_devices. * remove unwanted comments * remove comments * Add torchrun test to the CI. * added a ll_reduce test * fix ci failures * remove some comments * provide an alternative way to set the port for coordinator. * fix test by destroying the process group after the test * fix the single host test. * fix single host gpu tests. * add reduce scatter test * fix comments * fix a comment * fix comments * fix linter * fix comments * Use Local_WORLD_SIZE for spawn case. * fix more comments

vanbasten23 added the DO_NOT_REVIEW_YET label Sep 29, 2023

wbmc reviewed Sep 29, 2023

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated Show resolved Hide resolved

torch_xla/csrc/runtime/pjrt_computation_client.cc Outdated Show resolved Hide resolved

miladm reviewed Oct 10, 2023

View reviewed changes

vanbasten23 mentioned this pull request Oct 11, 2023

[PJRT] Support GPU Multiple Hosts #5558

Closed

vanbasten23 marked this pull request as ready for review October 11, 2023 00:01

vanbasten23 removed the DO_NOT_REVIEW_YET label Oct 11, 2023

vanbasten23 requested review from jonb377 and will-cromar October 11, 2023 00:01

vanbasten23 commented Oct 12, 2023

View reviewed changes

test/test_train_mp_imagenet_torchrun.py Outdated Show resolved Hide resolved

jonb377 reviewed Oct 12, 2023

View reviewed changes

yeounoh self-requested a review October 12, 2023 17:55

will-cromar reviewed Oct 12, 2023

View reviewed changes

torch_xla/core/xla_env_vars.py Show resolved Hide resolved

will-cromar reviewed Oct 12, 2023

View reviewed changes

torch_xla/_internal/pjrt.py Outdated Show resolved Hide resolved

will-cromar reviewed Oct 12, 2023

View reviewed changes

torch_xla/csrc/runtime/pjrt_computation_client.cc Outdated Show resolved Hide resolved

will-cromar reviewed Oct 12, 2023

View reviewed changes

vanbasten23 force-pushed the multihostgpu_poc_3 branch from 0cb25b4 to 5bd9939 Compare October 13, 2023 17:21

will-cromar reviewed Oct 14, 2023

View reviewed changes

vanbasten23 added 10 commits October 16, 2023 19:05

add prints

e8bf5b0

to be continued.

be3aca3

made torchrun works on single host

d7ced14

Add an example of resnet torchrun

4a8157e

add prints

ae8255b

use local rank for allowed_devices.

b8674b8

remove unwanted comments

243fa5f

remove comments

efb49ed

Add torchrun test to the CI.

2b1afdf

added a ll_reduce test

916996a

vanbasten23 added 11 commits October 16, 2023 19:05

remove some comments

93c03ac

provide an alternative way to set the port for coordinator.

adffcc5

fix test by destroying the process group after the test

87cc9b8

fix the single host test.

3b8d6ed

fix single host gpu tests.

fc06035

add reduce scatter test

82e9d43

fix comments

e43b49b

fix a comment

4bfb360

fix comments

02fb056

fix linter

9d87570

fix comments

f3a065f

vanbasten23 mentioned this pull request Oct 16, 2023

Add doc for multinode GPU training. #5704

Merged

will-cromar reviewed Oct 16, 2023

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated Show resolved Hide resolved

will-cromar reviewed Oct 16, 2023

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated Show resolved Hide resolved

will-cromar approved these changes Oct 16, 2023

View reviewed changes

Use Local_WORLD_SIZE for spawn case.

df4e450

vanbasten23 force-pushed the multihostgpu_poc_3 branch from 6e836ba to df4e450 Compare October 16, 2023 22:54

fix more comments

37a3be1

vanbasten23 requested review from jonb377, wbmc and miladm October 16, 2023 23:13

jonb377 approved these changes Oct 17, 2023

View reviewed changes

torch_xla/_internal/pjrt.py Show resolved Hide resolved

torch_xla/csrc/init_python_bindings.cpp Show resolved Hide resolved

torch_xla/csrc/init_python_bindings.cpp Show resolved Hide resolved

torch_xla/csrc/runtime/pjrt_computation_client.cc Show resolved Hide resolved

vanbasten23 merged commit 6ea9947 into master Oct 19, 2023

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

wbmc mentioned this pull request Nov 21, 2023

[Feature] PJRT support gpu multiple hosts #5559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-host GPU support #5657

Add multi-host GPU support #5657

vanbasten23 commented Sep 29, 2023 •

edited

Loading

JackCaoG commented Sep 29, 2023

wbmc commented Sep 29, 2023

miladm Oct 10, 2023 •

edited

Loading

vanbasten23 Oct 11, 2023

vanbasten23 Oct 13, 2023

jonb377 left a comment

jonb377 Oct 12, 2023

vanbasten23 Oct 12, 2023

jonb377 Oct 12, 2023

vanbasten23 Oct 12, 2023 •

edited

Loading

will-cromar Oct 12, 2023

vanbasten23 Oct 13, 2023

will-cromar Oct 14, 2023

vanbasten23 Oct 16, 2023

will-cromar left a comment

will-cromar Oct 14, 2023

vanbasten23 commented Oct 16, 2023

will-cromar left a comment

jonb377 left a comment

Add multi-host GPU support #5657

Add multi-host GPU support #5657

Conversation

vanbasten23 commented Sep 29, 2023 • edited Loading

JackCaoG commented Sep 29, 2023

wbmc commented Sep 29, 2023

miladm Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

will-cromar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Oct 16, 2023

will-cromar left a comment

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

vanbasten23 commented Sep 29, 2023 •

edited

Loading

miladm Oct 10, 2023 •

edited

Loading

vanbasten23 Oct 12, 2023 •

edited

Loading