Skip to content

Commit

Permalink
torchrun example with cpu version pytorch (kubeflow#1965)
Browse files Browse the repository at this point in the history
  • Loading branch information
kuizhiqing authored Dec 18, 2023
1 parent 1400f1f commit b938905
Show file tree
Hide file tree
Showing 4 changed files with 55 additions and 0 deletions.
7 changes: 7 additions & 0 deletions examples/pytorch/cpu-demo/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM python:3.8

RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

WORKDIR /

COPY demo.py .
7 changes: 7 additions & 0 deletions examples/pytorch/cpu-demo/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## Demo

This demo presents the usage of `torchrun` with training-operator.

> Make the `nprocPerNode` part consistent with the gpu resource declaration in GPU context.
The image used in demo.yaml is constructed with the Dockerfile provided alongside.
10 changes: 10 additions & 0 deletions examples/pytorch/cpu-demo/demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import torch
torch.distributed.init_process_group(init_method="env://")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
print(f"rank {rank} world_size {world_size}")
a = torch.tensor([1])
torch.distributed.all_reduce(a)
print(f"rank {rank} world_size {world_size} result {a}")
torch.distributed.barrier()
print(f"rank {rank} world_size {world_size}")
31 changes: 31 additions & 0 deletions examples/pytorch/cpu-demo/demo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: torchrun-cpu
spec:
nprocPerNode: "2"
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch-cpu:py3.8
imagePullPolicy: Always
command:
- "torchrun"
- "demo.py"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch-cpu:py3.8
imagePullPolicy: Always
command:
- "torchrun"
- "demo.py"

0 comments on commit b938905

Please sign in to comment.