This is the repository of paper "How Useful is Communication Scheduling for Distributed Training?".
The code is forked from BytePS.
As for PS and all-reduce, please use the code in branch bytescheduler. Please refer to the readme in bytescheduler of branch bytescheduler for detailed usage.
As for BytePS, please use the code in branch master. Please refer to xxx for detailed usage.
Testing scripts to repoduce the results in out paper.
We have used EC2 Image: Deep Learning Base AMI (Ubuntu 18.04) Version 32.0 ami-0404ddec9491a5a31 with CUDA 10.0
Belows are enviroment setup scripts from docker images with BytePS/Bytescheduler and MxNet/Pytorch/TensorFlow(TF environment needs to add some operators...).
- BytePS MxNet
- BytePS PyTorch
- BytePS TensorFlow
- PS MxNet
- All-Reduce MxNet
(Docker images have been exposed in zycccc)
Make sure that each machine can be connected from any other machines. You can edit the enviorment varible to modify your experiment content. We have provided our sample settings in each script file.
For horovod 0.16.1, default cycle-time 5ms is too long, resulting in long pauses between all-reduce calls, 1-2ms might be more suitable (specific value is determined by your machines). Similarly, the fusion buffer threshold of 64MB is often too small for models such as ResNet50 with fp32 gradients, change it to 128MB can significantly improve the throughput.
For any questions, please contact zhaoyh98 at pku dot edu dot cn
,