Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[DOC] Add a few tips for running horovod #17235

Merged
merged 4 commits into from
Jan 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 13 additions & 10 deletions docs/static_site/src/pages/api/faq/perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,8 @@ To reduce the communication cost, we can consider:
- Exploring different `--kv-store` options.
- Increasing the batch size to improve the computation to communication ratio.

Finally, MXNet is integrated with other distributed training frameworks, including [horovod](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training-horovod) and [BytePS](https://github.com/bytedance/byteps#use-byteps-in-your-code).

## Input Data

To make sure you're handling input data in a reasonable way consider the following:
Expand All @@ -284,30 +286,31 @@ For example, the safe batch size for CIFAR 10 is approximately 200, while for Im

## Profiler

As of v0.9.1 (with the NNVM merge), _MXNet_ has a built-in profiler
that gives detailed information about execution time at the symbol level.
_MXNet_ has a built-in profiler
that gives detailed information about execution time at the operator level.
This feature complements general profiling tools like _nvprof_ and _gprof_
by summarizing at the operator level, instead of a function, kernel, or instruction level.

The profiler can be turned on with an [environment variable]({{'/api/faq/env_var#control-the-profiler' | relative_url}})
for an entire program run, or programmatically for just part of a run.
for an entire program run, or programmatically for just part of a run. Note that by default the profiler hides the details of each individual operator, and you can reveal the details by setting environment variables `MXNET_EXEC_BULK_EXEC_INFERENCE`, `MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN` and `MXNET_EXEC_BULK_EXEC_TRAIN` to 0.
See [example/profiler](https://github.com/dmlc/mxnet/tree/master/example/profiler)
for complete examples of how to use the profiler in code, but briefly, the Python code looks like:
for complete examples of how to use the profiler in code, or [this tutorial](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html) on how to profile MXNet performance.

Briefly, the Python code looks like:

```python
mx.profiler.set_config(profile_all=True, filename='profile_output.json')
# wait for previous operations to complete
mx.nd.waitall()
mx.profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
mx.profiler.set_state('run')

# Code to be profiled goes here...

# wait for previous operations to complete
mx.nd.waitall()
mx.profiler.set_state('stop')
```

The `mode` parameter can be set to

* `symbolic` to only include symbolic operations
* `all` to include all operations

After the program finishes, navigate to your browser's tracing (Example - chrome://tracing in a Chrome browser) and load the `profile_output.json` file output by the profiler to inspect the results.

![MLP Profile](https://cloud.githubusercontent.com/assets/17693755/18035938/0a43484a-6d93-11e6-80d4-241c6ca552ea.png)
Expand Down
8 changes: 8 additions & 0 deletions example/distributed_training-horovod/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,11 @@ $ mpirun -np 8 \
-mca pml ob1 -mca btl ^openib \
python train.py
```

## Tuning Horovod Performance

1. To analyse horovod performance, [horovod timeline](https://github.com/horovod/horovod/blob/master/docs/timeline.rst) is a handy tool to trace and visualize the time spent on horovod operations.

2. A few tuning knobs affect horovod runtime performance (explained [here](https://github.com/horovod/horovod/blob/master/docs/tensor-fusion.rst)). Apart from `HOROVOD_FUSION_THRESHOLD`, sometimes we find increasing `HOROVOD_CYCLE_TIME` (up to 100 ms), changing [`NCCL_ALGO`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-algo), and [`NCCL_MIN_NCHANNELS`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-min-nchannels) improves performance.

3. If you are running horovod on AWS, you can potentially leverage [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) if your instance supports 100 Gb/s networking. To use EFA, you can refer to the [official documentation](https://docs.aws.amazon.com/eu_us/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html) for the setup instructions, and the environment variables (`-x FI_PROVIDER`, `-x FI_EFA_TX_MIN_CREDITS`) to set. Besides, you need to make sure EFA library is included in the shared library path (`-x LD_LIBRARY_PATH`).