[DOC] Add a few tips for running horovod (#17235)

* Update perf.md * Update README.md * Update README.md * Update perf.md
apache · Jan 9, 2020 · ac88f1e · ac88f1e
1 parent 6ba9aad
commit ac88f1e
Show file tree

Hide file tree

Showing 2 changed files with 21 additions and 10 deletions.
diff --git a/docs/static_site/src/pages/api/faq/perf.md b/docs/static_site/src/pages/api/faq/perf.md
@@ -268,6 +268,8 @@ To reduce the communication cost, we can consider:
 - Exploring different `--kv-store` options.
 - Increasing the batch size to improve the computation to communication ratio.
 
+Finally, MXNet is integrated with other distributed training frameworks, including [horovod](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training-horovod) and [BytePS](https://github.com/bytedance/byteps#use-byteps-in-your-code).
+
 ## Input Data
 
 To make sure you're handling input data in a reasonable way consider the following:
@@ -284,30 +286,31 @@ For example, the safe batch size for CIFAR 10 is approximately 200, while for Im
 
 ## Profiler
 
-As of v0.9.1 (with the NNVM merge), _MXNet_ has a built-in profiler
-that gives detailed information about execution time at the symbol level.
+_MXNet_ has a built-in profiler
+that gives detailed information about execution time at the operator level.
 This feature complements general profiling tools like _nvprof_ and _gprof_
 by summarizing at the operator level, instead of a function, kernel, or instruction level.
 
 The profiler can be turned on with an [environment variable]({{'/api/faq/env_var#control-the-profiler' | relative_url}})
-for an entire program run, or programmatically for just part of a run.
+for an entire program run, or programmatically for just part of a run. Note that by default the profiler hides the details of each individual operator, and you can reveal the details by setting environment variables `MXNET_EXEC_BULK_EXEC_INFERENCE`, `MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN` and `MXNET_EXEC_BULK_EXEC_TRAIN` to 0.
 See [example/profiler](https://github.com/dmlc/mxnet/tree/master/example/profiler)
-for complete examples of how to use the profiler in code, but briefly, the Python code looks like:
+for complete examples of how to use the profiler in code, or [this tutorial](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html) on how to profile MXNet performance.
+
+Briefly, the Python code looks like:
 
 ```python
-    mx.profiler.set_config(profile_all=True, filename='profile_output.json')
+    # wait for previous operations to complete
+    mx.nd.waitall() 
+    mx.profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
     mx.profiler.set_state('run')
 
     # Code to be profiled goes here...
 
+    # wait for previous operations to complete
+    mx.nd.waitall() 
     mx.profiler.set_state('stop')
 ```
 
-The `mode` parameter can be set to
-
-* `symbolic` to only include symbolic operations
-* `all` to include all operations
-
 After the program finishes, navigate to your browser's tracing (Example - chrome://tracing in a Chrome browser) and load the `profile_output.json` file output by the profiler to inspect the results.
 
 ![MLP Profile](https://cloud.githubusercontent.com/assets/17693755/18035938/0a43484a-6d93-11e6-80d4-241c6ca552ea.png)

diff --git a/example/distributed_training-horovod/README.md b/example/distributed_training-horovod/README.md
@@ -199,3 +199,11 @@ $ mpirun -np 8 \
     -mca pml ob1 -mca btl ^openib \
     python train.py
 ```
+
+## Tuning Horovod Performance
+
+1. To analyse horovod performance, [horovod timeline](https://github.com/horovod/horovod/blob/master/docs/timeline.rst) is a handy tool to trace and visualize the time spent on horovod operations. 
+
+2. A few tuning knobs affect horovod runtime performance (explained [here](https://github.com/horovod/horovod/blob/master/docs/tensor-fusion.rst)). Apart from `HOROVOD_FUSION_THRESHOLD`, sometimes we find increasing `HOROVOD_CYCLE_TIME` (up to 100 ms), changing [`NCCL_ALGO`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-algo), and [`NCCL_MIN_NCHANNELS`](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-min-nchannels) improves performance.
+
+3. If you are running horovod on AWS, you can potentially leverage [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) if your instance supports 100 Gb/s networking. To use EFA, you can refer to the [official documentation](https://docs.aws.amazon.com/eu_us/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html) for the setup instructions, and the environment variables (`-x FI_PROVIDER`, `-x FI_EFA_TX_MIN_CREDITS`) to set. Besides, you need to make sure EFA library is included in the shared library path (`-x LD_LIBRARY_PATH`).