FederatedAI · mgqa34 · Aug 18, 2023 · Aug 18, 2023
diff --git a/doc/2.0/README.md b/doc/2.0/README.md
@@ -2,4 +2,5 @@
 - [FATE V2.0 Quick Start](./quick_start.md)
 - [FATE Flow V2.0 Quick Start](https://github.com/FederatedAI/FATE-Flow/blob/v2.0.0-alpha/doc/quick_start.md)
 - [FATE FLOW V2.0 方案](https://github.com/FederatedAI/FATE-Flow/blob/v2.0.0-alpha/doc/2.0.0-alpha.md)
-- [OSX方案](./osx/osx.md)
+- [OSX方案](./osx/osx.md)
+- [FATE Components](./components)
diff --git a/doc/2.0/components/README.md b/doc/2.0/components/README.md
@@ -0,0 +1,44 @@
+# Federated Machine Learning
+
+[[中文](README.zh.md)]
+
+FATE-ML includes implementation of many common machine learning
+algorithms on federated learning. All modules are developed in a
+decoupling modular approach to enhance scalability. Specifically, we
+provide:
+
+1. Federated Statistic: PSI, Union, Pearson Correlation, etc.
+2. Federated Feature Engineering: Feature Sampling, Feature Binning,
+   Feature Selection, etc.
+3. Federated Machine Learning Algorithms: LR, GBDT, DNN
+4. Model Evaluation: Binary | Multiclass | Regression | Clustering
+   Evaluation
+5. Secure Protocol: Provides multiple security protocols for secure
+   multi-party computing and interaction between participants.
+
+## Algorithm List
+
+| Algorithm                                        | Module Name            | Description                                                                                                                        | Data Input                                    | Data Output                                                                | Model Input                   | Model Output |
+|--------------------------------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------------------|-------------------------------|--------------|
+| [PSI](psi.md)                                    | PSI                    | Compute intersect data set of multiple parties without leakage of difference set information. Mainly used in hetero scenario task. | input_data                                    | output_data                                                                |                               |              |
+| [Sampling](sample.md)                            | Sample                 | Federated Sampling data so that its distribution become balance in each party.This module supports local and federation scenario.  | input_data                                    | output_data                                                                |                               |              |
+| [Data Split](data_split.md)                      | DataSplit              | Split one data table into 3 tables by given ratio or count, this module supports local and federation scenario                     | input_data                                    | train_output_data, validate_output_data, test_output_data                  |                               |              |
+| [Feature Scale](feature_scale.md)                | FeatureScale           | module for feature scaling and standardization.                                                                                    | train_data, test_data                         | train_output_data, test_output_data                                        | input_model                   | output_model |
+| [Data Statistics](statistics.md)                 | Statistics             | This component will do some statistical work on the data, including statistical mean, maximum and minimum, median, etc.            | input_data                                    | output_data                                                                |                               | output_model |
+| [Hetero Feature Binning](feature_binning.md)     | HeteroFeatureBinning   | With binning input data, calculates each column's iv and woe and transform data according to the binned information.               | train_data, test_data                         | train_output_data, test_output_data                                        | input_model                   | output_model |
+| [Hetero Feature Selection](feature_selection.md) | HeteroFeatureSelection | Provide 3 types of filters. Each filters can select columns according to user config                                               | train_data, test_data                         | train_output_data, test_output_data                                        | input_models, input_model     | output_model |
+| [Coordinated-LR](logistic_regression.md)         | CoordinatedLR          | Build hetero logistic regression model through multiple parties.                                                                   | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
+| [Coordinated-LinR](linear_regression.md)         | CoordinatedLinR        | Build hetero linear regression model through multiple parties.                                                                     | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
+| [Homo-LR](logistic_regression.md)                | HomoLR                 | Build homo logistic regression model through multiple parties.                                                                     | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
+| [Homo-NN](homo_nn.md)                            | HomoNN                 | Build homo neural network model through multiple parties.                                                                          | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
+| [Hetero Secure Boosting](ensemble.md)            | HeteroSecureBoost      | Build hetero secure boosting model through multiple parties                                                                        | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
+| [Evaluation](evaluation.md)                      | Evaluation             | Output the model evaluation metrics for user.                                                                                      | input_data                                    |                                                                            |                               |              |
+| [Union](union.md)                                | Union                  | Combine multiple data tables into one.                                                                                             | input_data_list                               | output_data                                                                |                               |              |
+
+## Secure Protocol
+
+- [Encrypt](secureprotol.md#encrypt)
+    - [Paillier encryption](secureprotol.md#paillier-encryption)
+    - [RSA encryption](secureprotol.md#rsa-encryption)
+- [Hash](secureprotol.md#hash-factory)
+- [Diffie Hellman Key Exchange](secureprotol.md#diffie-hellman-key-exchange)
diff --git a/doc/2.0/components/data_split.md b/doc/2.0/components/data_split.md
@@ -0,0 +1,32 @@
+# Data Split
+
+Data Split module splits data into train, test, and/or validate
+sets of arbitrary sizes. The module is based on sampling method.
+
+# Use
+
+Data Split supports local(same as homogeneous) and heterogeneous (only Guest has y) mode.
+
+Here lists supported split modes and scenario.
+
+| Split Mode 	 | Federated Heterogeneous                                                        | Federated Homogeneous(Local)                                                   |
+|--------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------|
+| Random     	 | [&check;](../../../examples/pipeline/data_split/test_data_split.py)            | [&check;](../../../examples/pipeline/data_split/test_data_split_multi_host.py) |
+| Stratified 	 | [&check;](../../../examples/pipeline/data_split/test_data_split_stratified.py) | [&check;](../../../examples/pipeline/data_split/test_data_split_stratified.py) |
+
+Data Split module takes single data input as specified in job config file
+and always outputs three tables (train, test, and validate
+data sets). Each data ouput may be used as input of another module. Below are the
+rules regarding set sizes:
+
+1. if all three set sizes are None, the
+   original data input will be split in the following ratio: 80% to train
+   set, 20% to validate set, and an empty test set;
+
+2. if only test size or
+   validate size is given, train size is set to be of complement given
+   size;
+
+3. only one of the three sizes is needed to split input data, but
+   all three may be specified. The module takes either int (instance count)
+   or float (fraction) value for set sizes, but mixed-type inputs are not accepted.
diff --git a/doc/2.0/components/feature_binning.md b/doc/2.0/components/feature_binning.md
@@ -0,0 +1,48 @@
+# Hetero Feature Binning
+
+Feature binning or data binning is a data pre-processing technique. It
+can be used to reduce the effects of minor observation errors, calculate
+information values and so on.
+
+Currently, we provide quantile binning and bucket binning methods. To
+achieve quantile binning approach, we have used a special data structure
+mentioned in this
+[paper](https://www.researchgate.net/profile/Michael_Greenwald/publication/2854033_Space-Efficient_Online_Computation_of_Quantile_Summaries/links/0f317533ee009cd3f3000000/Space-Efficient-Online-Computation-of-Quantile-Summaries.pdf).
+Feel free to check out the detail algorithm in the paper.
+
+As for calculating the federated iv and woe values, the following figure
+can describe the principle properly.
+
+![Figure 1 (Federated Feature Binning
+Principle)](../images/binning_principle.png)
+
+As the figure shows, B party which has the data labels encrypt its
+labels with Addiction homomorphic encryption and then send to A. A
+static each bin's label sum and send back. Then B can calculate woe and
+iv base on the given information.
+
+For multiple hosts, it is similar with one host case. Guest sends its
+encrypted label information to all hosts, and each of the hosts
+calculates and sends back the static info.
+
+![Figure 2： Multi-Host Binning
+Principle](../images/multiple_host_binning.png)
+
+## Features
+
+1. Support Quantile Binning based on quantile summary algorithm.
+2. Support Bucket Binning.
+3. Support calculating woe and iv values.
+4. Support transforming data into bin indexes or woe value(guest only).
+5. Support multiple-host binning.
+6. Support asymmetric binning methods on Host & Guest sides.
+
+Below lists supported features with links to examples:
+
+| Cases                                | Scenario                                                                                                                                                                             	                                |
+|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Input Data with Categorical Features | [bucket binning](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py)        |
+| Output Data Transformed              | [bin index](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](.../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
+| Skip Metrics Calculation             | [multi_host](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py)                                           	                                                                        |
+
+
diff --git a/doc/2.0/components/feature_scale.md b/doc/2.0/components/feature_scale.md
@@ -0,0 +1,17 @@
+# Feature Scale
+
+Feature scale is a process that scales each feature along column.
+Feature Scale module supports min-max scale and standard scale.
+
+1. min-max scale: this estimator scales and translates each feature
+   individually such that it is in the given range on the training set,
+   e.g. between min and max value of each feature.
+2. standard scale: standardize features by removing the mean and
+   scaling to unit variance
+
+# Use
+
+| Scale Method | Federated Heterogeneous                                                | 
+|--------------|------------------------------------------------------------------------|
+| Min-Max      | [&check;](../../../examples/pipeline/sample/test_sample_unilateral.py) | 
+| Standard     | [&check;](../../../examples/pipeline/sample/test_sample_unilateral.py) |
diff --git a/doc/2.0/components/feature_selection.md b/doc/2.0/components/feature_selection.md
@@ -0,0 +1,57 @@
+# Hetero Feature Selection
+
+Feature selection is a process that selects a subset of features for
+model construction. Taking advantage of feature selection can improve
+model performance.
+
+In this version, we provide several filter methods for feature
+selection. Note that module works in a cascade manner where
+selected result of filter A will be input into next filter B.
+User should pay attention to the order of listing when
+supplying multiple filters to `filter_methods` param in job configuration.
+
+## Features
+
+Below lists available input models and their corresponding filter methods with links to examples:
+
+| Input Models      | Filter Method                                                                                                                                                                                  	                                                                                                                                                            |
+|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| None            	 | [manual](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_manual.py)                                                                                                                                             	                                                                                                                |
+| Binning         	 | [iv_filter(threshold)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py) <br> [iv_filter(top_k)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py) <br> [iv_filter(top_percentile)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py) |
+| Statistic       	 | [statistic_filter](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_statistics.py)                                                                                                                                                                                                                                                |
+
+Most of the filter methods above share the same set of configurable parameters.
+Below lists their acceptable parameter values.
+
+| Filter Method                     	 | Parameter Name  	 | metrics                                                                                                                                                | filter_type                            	 | take_high  	 |
+|-------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|--------------|
+| IV Filter                         	 | filter_param    	 | "iv"                                                                                                                                                   | "threshold", "top_k", "top_percentile" 	 | True       	 |
+| Statistic Filter                  	 | statistic_param 	 | "max", "min", "mean", "median", "std", "var", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" 	 | True/False 	 |
+
+1.
+    - iv\_filter: Use iv as criterion to selection features. Support
+      three mode: threshold value, top-k and top-percentile.
+
+        - threshold value: Filter those columns whose iv is smaller
+          than threshold. You can also set different threshold for
+          each party.
+        - top-k: Sort features from larger iv to smaller and take top
+          k features in the sorted result.
+        - top-percentile. Sort features from larger to smaller and
+          take top percentile.
+
+2. statistic\_filter: Use statistic values calculate from DataStatistic
+   component. Support coefficient of variance, missing value,
+   percentile value etc. You can pick the columns with higher statistic
+   values or smaller values as you need.
+
+3. manually: Indicate features that need to be filtered or kept.
+
+Besides, we support multi-host federated feature selection for iv
+filters. Starting in ver 2.0.0-beta, all data sets will obtain anonymous header
+during transformation from local file. Guest use iv filters' logic to judge
+whether a feature is left or not. Then guest sends result filter back to hosts.
+During this selection process, guest will not know the real name of host(s)' features.
+
+![Figure 4: Multi-Host Selection
+Principle\</div\>](../images/multi_host_selection.png)