paddle-openmpi

Run paddle distributed trainning on openmpi clusters

Requirements

This toy requires a kubernetes cluster to do the below.

Start a mpi cluster on kubernetes

kubectl create -f head.yaml
kubectl create -f mpi-nodes.yaml
# check the pods
kubectl get po -o wide

Find out the mpi node ips

kubectl get po -o wide | grep nodes | awk '{print $6}' > machines

Then copy the machines file to head node(same as ssh in to head node)

Run

You need to ssh into the head node in order to submit a job

ssh -i

Copy all program to each node:

cat machines | xargs -i scp start_mpi_train.sh trainer_config.lr.py dataprovider_bow.py {}:/home/tutorial

Prepare trainning data:

cd data
OUT_DIR=$PWD/input SPLIT_COUNT=3 sh get_data.sh
# copy splited data to each node:
scp -r input/0/data [node1]:~
scp -r input/1/data [node1]:~
scp -r input/2/data [node1]:~

Submit the job to mpi cluster:

mpirun -x PYTHONHOME=/usr/local -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paddle-openmpi

Requirements

Start a mpi cluster on kubernetes

Find out the mpi node ips

Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
Dockerfile		Dockerfile
README.md		README.md
dataprovider_bow.py		dataprovider_bow.py
head.yaml		head.yaml
mpi-nodes.yaml		mpi-nodes.yaml
start_mpi_train.sh		start_mpi_train.sh
trainer_config.lr.py		trainer_config.lr.py

typhoonzero/paddle-openmpi

Folders and files

Latest commit

History

Repository files navigation

paddle-openmpi

Requirements

Start a mpi cluster on kubernetes

Find out the mpi node ips

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages