chicm-ms · chicm-ms · Nov 25, 2019 · Nov 22, 2019 · Nov 22, 2019 · Nov 22, 2019
diff --git a/docs/en_US/Compressor/SlimPruner.md b/docs/en_US/Compressor/SlimPruner.md
@@ -34,6 +34,6 @@ We implemented one of the experiments in ['Learning Efficient Convolutional Netw
 | Model         | Error(paper/ours) | Parameters | Pruned    |
 | ------------- | ----------------- | ---------- | --------- |
 | VGGNet        | 6.34/6.40     | 20.04M   |           |
-| Pruned-VGGNet | 6.20/6.39     | 2.03M    | 88.5% |
+| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
 
 The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
diff --git a/docs/en_US/FeatureEngineering/GBDTSelector.md b/docs/en_US/FeatureEngineering/GBDTSelector.md
@@ -0,0 +1,61 @@
+## GBDTSelector
+
+GBDTSelector is based on [LightGBM](https://github.com/microsoft/LightGBM), which is a gradient boosting framework that uses tree-based learning algorithms.
+
+When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
+
+We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
+
+For now, we support the `importance_type` is `split` and `gain`. But we will support customized `importance_type` in the future, which means the user could define how to calculate the `feature score` by themselves.
+
+### Usage
+
+First you need to install dependency:
+
+```
+pip install lightgbm
+```
+
+Then
+
+```python
+from nni.feature_engineering.gbdt_selector import GBDTSelector
+
+# load data
+...
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+# initlize a selector
+fgs = GBDTSelector()
+# fit data
+fgs.fit(X_train, y_train, ...)
+# get improtant features
+# will return the index with important feature here.
+print(fgs.get_selected_features(10))
+
+...
+```
+
+And you could reference the examples in `/examples/feature_engineering/gbdt_selector/`, too.
+
+
+**Requirement of `fit` FuncArgs**
+
+* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+
+* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+
+* **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html)
+
+* **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
+
+* **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).
+
+* **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance).
+
+* **num_boost_round** (int, require) - number of boost round. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train).
+
+**Requirement of `get_selected_features` FuncArgs**
+
+* **topk** (int, require) - the topK impotance features you want to selected.
+
diff --git a/docs/en_US/FeatureEngineering/GradientFeatureSelector.md b/docs/en_US/FeatureEngineering/GradientFeatureSelector.md
@@ -0,0 +1,86 @@
+## GradientFeatureSelector
+
+The algorithm in GradinetFeatureSelector comes from ["Feature Gradients: Scalable Feature Selection via Discrete Relaxation"](https://arxiv.org/pdf/1908.10382.pdf).
+
+GradientFeatureSelector, a gradient-based search algorithm
+for feature selection. 
+
+1) This approach extends a recent result on the estimation of
+learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N. 
+
+2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
+
+3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
+
+
+### Usage
+
+```python
+from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+
+# load data
+...
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+# initlize a selector
+fgs = FeatureGradientSelector(n_features=10)
+# fit data
+fgs.fit(X_train, y_train)
+# get improtant features
+# will return the index with important feature here.
+print(fgs.get_selected_features())
+
+...
+```
+
+And you could reference the examples in `/examples/feature_engineering/gradient_feature_selector/`, too.
+
+**Parameters of class FeatureGradientSelector constructor**
+
+* **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
+
+* **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
+
+* **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
+
+* **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
+
+* **learning_rate** (float, optional, default = 1e-1) - learning rate
+
+* **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*) - How to initialize the vector of scores. 'zero' is the default.
+
+* **n_epochs** (int, optional, default = 1) - number of epochs to run
+
+* **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
+
+* **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
+
+* **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
+
+* **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
+
+* **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
+
+* **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
+
+* **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
+
+* **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
+
+* **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
+
+* **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
+
+
+**Requirement of `fit` FuncArgs**
+
+* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+
+* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+
+* **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
+
+**Requirement of `get_selected_features` FuncArgs**
+
+ For now, the `get_selected_features` function has no parameters.
+
diff --git a/docs/en_US/FeatureEngineering/Overview.md b/docs/en_US/FeatureEngineering/Overview.md
@@ -0,0 +1,3 @@
+# FeatureEngineering
+
+We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
diff --git a/docs/en_US/TrainingService/LocalMode.md b/docs/en_US/TrainingService/LocalMode.md
@@ -96,7 +96,7 @@ This command will be filled in the YAML configure file below. Please refer to [h
 
 **Prepare configure file**: Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, `cat ~/nni/examples/trials/mnist-annotation/config.yml` to see it. Its content is basically shown below:
 
-```
+```yaml
 authorName: your_name
 experimentName: auto_mnist
 
@@ -112,6 +112,9 @@ maxTrialNum: 100
 # choice: local, remote
 trainingServicePlatform: local
 
+# search space file
+searchSpacePath: search_space.json
+
 # choice: true, false
 useAnnotation: true
 tuner:

diff --git a/docs/en_US/TrainingService/PaiMode.md b/docs/en_US/TrainingService/PaiMode.md
@@ -19,6 +19,8 @@ maxExecDuration: 3h
 maxTrialNum: 100
 # choice: local, remote, pai
 trainingServicePlatform: pai
+# search space file
+searchSpacePath: search_space.json
 # choice: true, false
 useAnnotation: true
 tuner:

diff --git a/docs/en_US/TrainingService/RemoteMachineMode.md b/docs/en_US/TrainingService/RemoteMachineMode.md
@@ -28,6 +28,8 @@ maxExecDuration: 1h
 maxTrialNum: 10
 #choice: local, remote, pai
 trainingServicePlatform: remote
+# search space file
+searchSpacePath: search_space.json
 #choice: true, false
 useAnnotation: true
 tuner:

diff --git a/docs/zh_CN/TrainingService/LocalMode.md b/docs/zh_CN/TrainingService/LocalMode.md
@@ -98,33 +98,36 @@
 *builtinTunerName* 用来指定 NNI 中的 Tuner，*classArgs* 是传入到 Tuner的参数（内置 Tuner 在[这里](../Tuner/BuiltinTuner.md)），*optimization_mode* 表明需要最大化还是最小化 Trial 的结果。
 
 **准备配置文件**：实现 Trial 的代码，并选择或实现自定义的 Tuner 后，就要准备 YAML 配置文件了。 NNI 为每个 Trial 样例都提供了演示的配置文件，用命令`cat ~/nni/examples/trials/mnist-annotation/config.yml` 来查看其内容。 大致内容如下：
-
-    authorName: your_name
-    experimentName: auto_mnist
-
-    # 并发运行数量
-    trialConcurrency: 2
-
-    # Experiment 运行时间
-    maxExecDuration: 3h
-
-    # 可为空，即数量不限
-    maxTrialNum: 100
-
-    # 可选值为: local, remote
-    trainingServicePlatform: local
-
-    # 可选值为: true, false
-    useAnnotation: true
-    tuner:
-      builtinTunerName: TPE
-      classArgs:
-        optimize_mode: maximize
-    trial:
-      command: python mnist.py
-      codeDir: ~/nni/examples/trials/mnist-annotation
-      gpuNum: 0
-
+```yaml
+authorName: your_name
+experimentName: auto_mnist
+
+# 并发运行数量
+trialConcurrency: 2
+
+# Experiment 运行时间
+maxExecDuration: 3h
+
+# 可为空，即数量不限
+maxTrialNum: 100
+
+# 可选值为: local, remote
+trainingServicePlatform: local
+
+# 搜索空间文件
+searchSpacePath: search_space.json
+
+# 可选值为: true, false
+useAnnotation: true
+tuner:
+  builtinTunerName: TPE
+  classArgs:
+    optimize_mode: maximize
+trial:
+  command: python mnist.py
+  codeDir: ~/nni/examples/trialsmnist-annotation
+  gpuNum: 0
+ ```   
 
 因为这个 Trial 代码使用了 NNI Annotation 的方法（参考[这里](../Tutorial/AnnotationSpec.md) ），所以*useAnnotation* 为 true。 *command* 是运行 Trial 代码所需要的命令，*codeDir* 是 Trial 代码的相对位置。 命令会在此目录中执行。 同时，也需要提供每个 Trial 进程所需的 GPU 数量。
 

diff --git a/docs/zh_CN/TrainingService/PaiMode.md b/docs/zh_CN/TrainingService/PaiMode.md
@@ -21,6 +21,8 @@ maxExecDuration: 3h
 maxTrialNum: 100
 # 可选项: local, remote, pai
 trainingServicePlatform: pai
+# 搜索空间文件
+searchSpacePath: search_space.json
 # 可选项: true, false
 useAnnotation: true
 tuner:

diff --git a/docs/zh_CN/TrainingService/RemoteMachineMode.md b/docs/zh_CN/TrainingService/RemoteMachineMode.md
@@ -28,6 +28,8 @@ maxExecDuration: 1h
 maxTrialNum: 10
 #可选项: local, remote, pai
 trainingServicePlatform: remote
+# 搜索空间文件
+searchSpacePath: search_space.json
 #可选项: true, false
 useAnnotation: true
 tuner:

diff --git a/examples/feature_engineering/gbdt_selector/gbdt_selector_test.py b/examples/feature_engineering/gbdt_selector/gbdt_selector_test.py
@@ -0,0 +1,65 @@
+# Copyright (c) Microsoft Corporation
+# All rights reserved.
+#
+# MIT License
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
+# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
+# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
+# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
+# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+import bz2
+import urllib.request
+import numpy as np
+
+from sklearn.datasets import load_svmlight_file
+from sklearn.model_selection import train_test_split
+
+from nni.feature_engineering.gbdt_selector import GBDTSelector
+
+url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
+urllib.request.urlretrieve(url_zip_train, filename='train.bz2')
+
+f_svm = open('train.svm', 'wt')
+with bz2.open('train.bz2', 'rb') as f_zip:
+    data = f_zip.read()
+    f_svm.write(data.decode('utf-8'))
+f_svm.close()
+
+X, y = load_svmlight_file('train.svm')
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+lgb_params = {
+        'boosting_type': 'gbdt',
+        'objective': 'regression',
+        'metric': {'l2', 'l1'},
+        'num_leaves': 20,
+        'learning_rate': 0.05,
+        'feature_fraction': 0.9,
+        'bagging_fraction': 0.8,
+        'bagging_freq': 5,
+        'verbose': 0}
+
+eval_ratio = 0.1
+early_stopping_rounds = 10
+importance_type = 'gain'
+num_boost_round = 1000
+topk = 10
+
+selector = GBDTSelector()
+selector.fit(X_train, y_train,
+             lgb_params = lgb_params,
+             eval_ratio = eval_ratio,
+             early_stopping_rounds = early_stopping_rounds,
+             importance_type = importance_type,
+             num_boost_round = num_boost_round)
+
+print("selected features\t", selector.get_selected_features(topk=topk))
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# FeatureEngineering

		We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.