Skip to content

Commit

Permalink
Hardware Auto-Setup Example/Tutorial for Distributed Launch (#1227)
Browse files Browse the repository at this point in the history
* add self hosted hardware example

add multi gpu launch script

add auto setup hardware docs

remove an example

tiny fixes

* add colab link

* style

* update readme, remove docs page
  • Loading branch information
Caroline Chen authored Mar 24, 2023
1 parent c1a6c20 commit 1fe27e7
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ doc-builder preview {package_name} {path_to_docs}
For example:

```bash
doc-builder preview transformers docs/source/en/
doc-builder preview accelerate docs/source/
```

The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.
Expand Down
17 changes: 16 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,22 @@ To run it in each of these various modes, use the following commands:

### Using AWS SageMaker integration
- [Examples showcasing AWS SageMaker integration of 🤗 Accelerate.](https://github.com/pacman100/accelerate-aws-sagemaker)



## Simple Multi-GPU Hardware Launcher

[multigpu_remote_launcher.py](./multigpu_remote_launcher.py) is a minimal script that demonstrates launching accelerate
on multiple remote GPUs, and with automatic hardware environment and dependency setup for reproducibility. You can
easily customize the training function used, training arguments, hyperparameters, and type of compute hardware, and then
run the script to automatically launch multi GPU training on remote hardware.

This script uses [Runhouse](https://github.com/run-house/runhouse) to launch on self-hosted hardware (e.g. in your own
cloud account or on-premise cluster) but there are other options for running remotely as well. Runhouse can be installed
with `pip install runhouse`, and you can refer to
[hardware setup](https://runhouse-docs.readthedocs-hosted.com/en/main/rh_primitives/cluster.html#hardware-setup)
for hardware setup instructions, or this
[Colab tutorial](https://colab.research.google.com/drive/1qVwYyLTCPYPSdz9ZX7BZl9Qm0A3j7RJe) for a more in-depth walkthrough.

## Finer Examples

While the first two scripts are extremely barebones when it comes to what you can do with accelerate, more advanced features are documented in two other locations.
Expand Down
55 changes: 55 additions & 0 deletions examples/multigpu_remote_launcher.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import argparse

import runhouse as rh
import torch
from nlp_example import training_function

from accelerate.utils import PrepareForLaunch, patch_environment


def launch_train(*args):
num_processes = torch.cuda.device_count()
print(f"Device count: {num_processes}")
with patch_environment(
world_size=num_processes, master_addr="127.0.01", master_port="29500", mixed_precision=args[1].mixed_precision
):
launcher = PrepareForLaunch(training_function, distributed_type="MULTI_GPU")
torch.multiprocessing.start_processes(launcher, args=args, nprocs=num_processes, start_method="spawn")


if __name__ == "__main__":
# Refer to https://runhouse-docs.readthedocs-hosted.com/en/main/rh_primitives/cluster.html#hardware-setup
# for cloud access setup instructions (if using on-demand hardware), and for API specifications.

# on-demand GPU
# gpu = rh.cluster(name='rh-cluster', instance_type='V100:1', provider='cheapest', use_spot=False) # single GPU
gpu = rh.cluster(name="rh-cluster", instance_type="V100:4", provider="cheapest", use_spot=False) # multi GPU
gpu.up_if_not()

# on-prem GPU
# gpu = rh.cluster(
# ips=["ip_addr"], ssh_creds={ssh_user:"<username>", ssh_private_key:"<key_path>"}, name="rh-cluster"
# )

# Set up remote function
reqs = [
"pip:./",
"transformers",
"datasets",
"evaluate",
"tqdm",
"scipy",
"scikit-learn",
"tensorboard",
"torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu117",
]
launch_train_gpu = rh.function(fn=launch_train, system=gpu, reqs=reqs, name="train_bert_glue")

# Define train args/config, run train function
train_args = argparse.Namespace(cpu=False, mixed_precision="fp16")
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
launch_train_gpu(config, train_args, stream_logs=True)

# Alternatively, we can just run as instructed in the README (but only because there's already a wrapper CLI):
# gpu.install_packages(reqs)
# gpu.run(['accelerate launch --multi_gpu accelerate/examples/nlp_example.py'])

0 comments on commit 1fe27e7

Please sign in to comment.