How Useful is Communication Scheduling for Distributed Training?

Introduction

This is the repository of paper "How Useful is Communication Scheduling for Distributed Training?".
The code is forked from BytePS.
As for PS and all-reduce, please use the code in branch bytescheduler. Please refer to the readme in bytescheduler of branch bytescheduler for detailed usage.
As for BytePS, please use the code in branch master. Please refer to xxx for detailed usage.

Content

Testing scripts to repoduce the results in out paper.

Environment requirement

We have used EC2 Image: Deep Learning Base AMI (Ubuntu 18.04) Version 32.0 ami-0404ddec9491a5a31 with CUDA 10.0
Belows are enviroment setup scripts from docker images with BytePS/Bytescheduler and MxNet/Pytorch/TensorFlow(TF environment needs to add some operators...).

BytePS MxNet
BytePS PyTorch
BytePS TensorFlow
PS MxNet
All-Reduce MxNet
(Docker images have been exposed in zycccc)

How to reproduce the results

Make sure that each machine can be connected from any other machines. You can edit the enviorment varible to modify your experiment content. We have provided our sample settings in each script file.

For horovod 0.16.1, default cycle-time 5ms is too long, resulting in long pauses between all-reduce calls, 1-2ms might be more suitable (specific value is determined by your machines). Similarly, the fusion buffer threshold of 64MB is often too small for models such as ResNet50 with fp32 gradients, change it to 128MB can significantly improve the throughput.

Contact

For any questions, please contact zhaoyh98 at pku dot edu dot cn,

Name		Name	Last commit message	Last commit date
Latest commit History 572 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
3rdparty		3rdparty
byteps		byteps
docker		docker
docs		docs
example		example
examples @ 4a74202		examples @ 4a74202
examples-byteps		examples-byteps
launcher		launcher
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
byteps.exp		byteps.exp
byteps.lds		byteps.lds
bytescheduler-0105.pem		bytescheduler-0105.pem
mlnet.pem		mlnet.pem
original_readme.md		original_readme.md
pre_setup.py		pre_setup.py
setup.py		setup.py
setup_nodes_byteps_mxnet.sh		setup_nodes_byteps_mxnet.sh
setup_nodes_byteps_pytorch.sh		setup_nodes_byteps_pytorch.sh
setup_nodes_byteps_tf.sh		setup_nodes_byteps_tf.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Useful is Communication Scheduling for Distributed Training?

Introduction

Content

Environment requirement

How to reproduce the results

Contact

About

Releases

Packages

Languages

License

netx-repo/byteps

Folders and files

Latest commit

History

Repository files navigation

How Useful is Communication Scheduling for Distributed Training?

Introduction

Content

Environment requirement

How to reproduce the results

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages