microsoft · ultmaster · May 28, 2021 · May 27, 2021
diff --git a/docs/en_US/hpo_benchmark.rst b/docs/en_US/hpo_benchmark.rst
@@ -14,6 +14,8 @@ Terminology
 * **tuner**\ : a tuner or advisor defined in the hpo folder, or a custom tuner provided by the user. 
 * **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). In our implementation, the architecture calls tuner multiple times to obtain possible hyperparameter configurations, and produces the final prediction for a task. See ``./nni/extensions/NNI/architectures`` for examples.
 
+Note: currently, the only architecture supported is random forest. The architecture implementation and search space definition can be found in ``./nni/extensions/NNI/architectures/run_random_forest.py``. The tasks in benchmarks "nnivalid" and "nnismall" are suitable to solve with random forests. 
+
 Setup
 ^^^^^
 
@@ -60,15 +62,15 @@ To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.T
 A Benchmark Example 
 ^^^^^^^^^^^^^^^^^^^
 
-As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. 
+As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used. 
 
-After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. 
+After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. 
 
 .. image:: ../img/hpo_benchmark/performances.png
    :target: ../img/hpo_benchmark/performances.png
    :alt: 
 
-When the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents a ranking of the tuners for each metric (logloss, rmse, auc), and the rankings of tuners for each metric (another view of the same data).
+In addition, when the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents the average ranking of the tuners for each metric (logloss, rmse, auc). Here we present the data in the first three tables. Also, for every tuner, their performance for each type of metric is summarized (another view of the same data). We present this statistics in the fourth table. 
 
 Average rankings for metric rmse:
 
@@ -195,6 +197,7 @@ Besides these reports, our script also generates two graphs for each fold of eac
    :target: ../img/hpo_benchmark/car_fold1_2.jpg
    :alt: 
 
+For example, the previous two graphs are generated for fold 1 of the task "car". In the first graph, we can observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its score improved three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment duration. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. Regardless, although this pattern can to some extent be interpreted as a tuner's position on the explore-exploit tradeoff, it cannot be used for a comprehensive evaluation of a tuner's effectiveness. 
 
 .. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
    :target: ../img/hpo_benchmark/christine_fold0_1.jpg

diff --git a/examples/trials/benchmarking/automlbenchmark/setup.sh b/examples/trials/benchmarking/automlbenchmark/setup.sh
@@ -1,6 +1,3 @@
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-
 #!/bin/bash
 
 # download automlbenchmark repository