Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
[DOC] Add a few tips for running horovod (#17235)
Browse files Browse the repository at this point in the history
* Update perf.md

* Update README.md

* Update README.md

* Update perf.md
  • Loading branch information
eric-haibin-lin authored Jan 9, 2020
1 parent 6ba9aad commit ac88f1e
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 10 deletions.
23 changes: 13 additions & 10 deletions docs/static_site/src/pages/api/faq/perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,8 @@ To reduce the communication cost, we can consider:
- Exploring different `--kv-store` options.
- Increasing the batch size to improve the computation to communication ratio.

Finally, MXNet is integrated with other distributed training frameworks, including [horovod](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training-horovod) and [BytePS](https://github.com/bytedance/byteps#use-byteps-in-your-code).

## Input Data

To make sure you're handling input data in a reasonable way consider the following:
Expand All @@ -284,30 +286,31 @@ For example, the safe batch size for CIFAR 10 is approximately 200, while for Im

## Profiler

As of v0.9.1 (with the NNVM merge), _MXNet_ has a built-in profiler
that gives detailed information about execution time at the symbol level.
_MXNet_ has a built-in profiler
that gives detailed information about execution time at the operator level.
This feature complements general profiling tools like _nvprof_ and _gprof_
by summarizing at the operator level, instead of a function, kernel, or instruction level.

The profiler can be turned on with an [environment variable]({{'/api/faq/env_var#control-the-profiler' | relative_url}})
for an entire program run, or programmatically for just part of a run.
for an entire program run, or programmatically for just part of a run. Note that by default the profiler hides the details of each individual operator, and you can reveal the details by setting environment variables `MXNET_EXEC_BULK_EXEC_INFERENCE`, `MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN` and `MXNET_EXEC_BULK_EXEC_TRAIN` to 0.
See [example/profiler](https://github.com/dmlc/mxnet/tree/master/example/profiler)
for complete examples of how to use the profiler in code, but briefly, the Python code looks like:
for complete examples of how to use the profiler in code, or [this tutorial](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html) on how to profile MXNet performance.

Briefly, the Python code looks like:

```python
mx.profiler.set_config(profile_all=True, filename='profile_output.json')
# wait for previous operations to complete
mx.nd.waitall()
mx.profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
mx.profiler.set_state('run')

# Code to be profiled goes here...

# wait for previous operations to complete
mx.nd.waitall()
mx.profiler.set_state('stop')
```

The `mode` parameter can be set to

* `symbolic` to only include symbolic operations
* `all` to include all operations

After the program finishes, navigate to your browser's tracing (Example - chrome://tracing in a Chrome browser) and load the `profile_output.json` file output by the profiler to inspect the results.

![MLP Profile](https://cloud.githubusercontent.com/assets/17693755/18035938/0a43484a-6d93-11e6-80d4-241c6ca552ea.png)
Expand Down
8 changes: 8 additions & 0 deletions example/distributed_training-horovod/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,11 @@ $ mpirun -np 8 \
-mca pml ob1 -mca btl ^openib \
python train.py
```

## Tuning Horovod Performance

1. To analyse horovod performance, [horovod timeline](https://github.com/horovod/horovod/blob/master/docs/timeline.rst) is a handy tool to trace and visualize the time spent on horovod operations.

2. A few tuning knobs affect horovod runtime performance (explained [here](https://github.com/horovod/horovod/blob/master/docs/tensor-fusion.rst)). Apart from `HOROVOD_FUSION_THRESHOLD`, sometimes we find increasing `HOROVOD_CYCLE_TIME` (up to 100 ms), changing [`NCCL_ALGO`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-algo), and [`NCCL_MIN_NCHANNELS`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-min-nchannels) improves performance.

3. If you are running horovod on AWS, you can potentially leverage [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) if your instance supports 100 Gb/s networking. To use EFA, you can refer to the [official documentation](https://docs.aws.amazon.com/eu_us/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html) for the setup instructions, and the environment variables (`-x FI_PROVIDER`, `-x FI_EFA_TX_MIN_CREDITS`) to set. Besides, you need to make sure EFA library is included in the shared library path (`-x LD_LIBRARY_PATH`).

0 comments on commit ac88f1e

Please sign in to comment.