Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow only default axis 0 for union(#4668) #5052

Merged
merged 1 commit into from
Aug 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/2.0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
- [FATE V2.0 Quick Start](./quick_start.md)
- [FATE Flow V2.0 Quick Start](https://github.com/FederatedAI/FATE-Flow/blob/v2.0.0-alpha/doc/quick_start.md)
- [FATE FLOW V2.0 方案](https://github.com/FederatedAI/FATE-Flow/blob/v2.0.0-alpha/doc/2.0.0-alpha.md)
- [OSX方案](./osx/osx.md)
- [OSX方案](./osx/osx.md)
- [FATE Components](./components)
44 changes: 44 additions & 0 deletions doc/2.0/components/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Federated Machine Learning

[[中文](README.zh.md)]

FATE-ML includes implementation of many common machine learning
algorithms on federated learning. All modules are developed in a
decoupling modular approach to enhance scalability. Specifically, we
provide:

1. Federated Statistic: PSI, Union, Pearson Correlation, etc.
2. Federated Feature Engineering: Feature Sampling, Feature Binning,
Feature Selection, etc.
3. Federated Machine Learning Algorithms: LR, GBDT, DNN
4. Model Evaluation: Binary | Multiclass | Regression | Clustering
Evaluation
5. Secure Protocol: Provides multiple security protocols for secure
multi-party computing and interaction between participants.

## Algorithm List

| Algorithm | Module Name | Description | Data Input | Data Output | Model Input | Model Output |
|--------------------------------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------------------|-------------------------------|--------------|
| [PSI](psi.md) | PSI | Compute intersect data set of multiple parties without leakage of difference set information. Mainly used in hetero scenario task. | input_data | output_data | | |
| [Sampling](sample.md) | Sample | Federated Sampling data so that its distribution become balance in each party.This module supports local and federation scenario. | input_data | output_data | | |
| [Data Split](data_split.md) | DataSplit | Split one data table into 3 tables by given ratio or count, this module supports local and federation scenario | input_data | train_output_data, validate_output_data, test_output_data | | |
| [Feature Scale](feature_scale.md) | FeatureScale | module for feature scaling and standardization. | train_data, test_data | train_output_data, test_output_data | input_model | output_model |
| [Data Statistics](statistics.md) | Statistics | This component will do some statistical work on the data, including statistical mean, maximum and minimum, median, etc. | input_data | output_data | | output_model |
| [Hetero Feature Binning](feature_binning.md) | HeteroFeatureBinning | With binning input data, calculates each column's iv and woe and transform data according to the binned information. | train_data, test_data | train_output_data, test_output_data | input_model | output_model |
| [Hetero Feature Selection](feature_selection.md) | HeteroFeatureSelection | Provide 3 types of filters. Each filters can select columns according to user config | train_data, test_data | train_output_data, test_output_data | input_models, input_model | output_model |
| [Coordinated-LR](logistic_regression.md) | CoordinatedLR | Build hetero logistic regression model through multiple parties. | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
| [Coordinated-LinR](linear_regression.md) | CoordinatedLinR | Build hetero linear regression model through multiple parties. | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
| [Homo-LR](logistic_regression.md) | HomoLR | Build homo logistic regression model through multiple parties. | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
| [Homo-NN](homo_nn.md) | HomoNN | Build homo neural network model through multiple parties. | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
| [Hetero Secure Boosting](ensemble.md) | HeteroSecureBoost | Build hetero secure boosting model through multiple parties | train_data, validate_data, test_data, cv_data | train_output_data, validate_output_data, test_output_data, cv_output_datas | input_model, warm_start_model | output_model |
| [Evaluation](evaluation.md) | Evaluation | Output the model evaluation metrics for user. | input_data | | | |
| [Union](union.md) | Union | Combine multiple data tables into one. | input_data_list | output_data | | |

## Secure Protocol

- [Encrypt](secureprotol.md#encrypt)
- [Paillier encryption](secureprotol.md#paillier-encryption)
- [RSA encryption](secureprotol.md#rsa-encryption)
- [Hash](secureprotol.md#hash-factory)
- [Diffie Hellman Key Exchange](secureprotol.md#diffie-hellman-key-exchange)
32 changes: 32 additions & 0 deletions doc/2.0/components/data_split.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Data Split

Data Split module splits data into train, test, and/or validate
sets of arbitrary sizes. The module is based on sampling method.

# Use

Data Split supports local(same as homogeneous) and heterogeneous (only Guest has y) mode.

Here lists supported split modes and scenario.

| Split Mode | Federated Heterogeneous | Federated Homogeneous(Local) |
|--------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| Random | [✓](../../../examples/pipeline/data_split/test_data_split.py) | [✓](../../../examples/pipeline/data_split/test_data_split_multi_host.py) |
| Stratified | [✓](../../../examples/pipeline/data_split/test_data_split_stratified.py) | [✓](../../../examples/pipeline/data_split/test_data_split_stratified.py) |

Data Split module takes single data input as specified in job config file
and always outputs three tables (train, test, and validate
data sets). Each data ouput may be used as input of another module. Below are the
rules regarding set sizes:

1. if all three set sizes are None, the
original data input will be split in the following ratio: 80% to train
set, 20% to validate set, and an empty test set;

2. if only test size or
validate size is given, train size is set to be of complement given
size;

3. only one of the three sizes is needed to split input data, but
all three may be specified. The module takes either int (instance count)
or float (fraction) value for set sizes, but mixed-type inputs are not accepted.
48 changes: 48 additions & 0 deletions doc/2.0/components/feature_binning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Hetero Feature Binning

Feature binning or data binning is a data pre-processing technique. It
can be used to reduce the effects of minor observation errors, calculate
information values and so on.

Currently, we provide quantile binning and bucket binning methods. To
achieve quantile binning approach, we have used a special data structure
mentioned in this
[paper](https://www.researchgate.net/profile/Michael_Greenwald/publication/2854033_Space-Efficient_Online_Computation_of_Quantile_Summaries/links/0f317533ee009cd3f3000000/Space-Efficient-Online-Computation-of-Quantile-Summaries.pdf).
Feel free to check out the detail algorithm in the paper.

As for calculating the federated iv and woe values, the following figure
can describe the principle properly.

![Figure 1 (Federated Feature Binning
Principle)](../images/binning_principle.png)

As the figure shows, B party which has the data labels encrypt its
labels with Addiction homomorphic encryption and then send to A. A
static each bin's label sum and send back. Then B can calculate woe and
iv base on the given information.

For multiple hosts, it is similar with one host case. Guest sends its
encrypted label information to all hosts, and each of the hosts
calculates and sends back the static info.

![Figure 2: Multi-Host Binning
Principle](../images/multiple_host_binning.png)

## Features

1. Support Quantile Binning based on quantile summary algorithm.
2. Support Bucket Binning.
3. Support calculating woe and iv values.
4. Support transforming data into bin indexes or woe value(guest only).
5. Support multiple-host binning.
6. Support asymmetric binning methods on Host & Guest sides.

Below lists supported features with links to examples:

| Cases | Scenario |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input Data with Categorical Features | [bucket binning](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) |
| Output Data Transformed | [bin index](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](.../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
| Skip Metrics Calculation | [multi_host](../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) |


17 changes: 17 additions & 0 deletions doc/2.0/components/feature_scale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Feature Scale

Feature scale is a process that scales each feature along column.
Feature Scale module supports min-max scale and standard scale.

1. min-max scale: this estimator scales and translates each feature
individually such that it is in the given range on the training set,
e.g. between min and max value of each feature.
2. standard scale: standardize features by removing the mean and
scaling to unit variance

# Use

| Scale Method | Federated Heterogeneous |
|--------------|------------------------------------------------------------------------|
| Min-Max | [&check;](../../../examples/pipeline/sample/test_sample_unilateral.py) |
| Standard | [&check;](../../../examples/pipeline/sample/test_sample_unilateral.py) |
57 changes: 57 additions & 0 deletions doc/2.0/components/feature_selection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Hetero Feature Selection

Feature selection is a process that selects a subset of features for
model construction. Taking advantage of feature selection can improve
model performance.

In this version, we provide several filter methods for feature
selection. Note that module works in a cascade manner where
selected result of filter A will be input into next filter B.
User should pay attention to the order of listing when
supplying multiple filters to `filter_methods` param in job configuration.

## Features

Below lists available input models and their corresponding filter methods with links to examples:

| Input Models | Filter Method |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| None | [manual](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_manual.py) |
| Binning | [iv_filter(threshold)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py) <br> [iv_filter(top_k)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py) <br> [iv_filter(top_percentile)](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py) |
| Statistic | [statistic_filter](../../../examples/pipeline/hetero_feature_selection/test_feature_selection_statistics.py) |

Most of the filter methods above share the same set of configurable parameters.
Below lists their acceptable parameter values.

| Filter Method | Parameter Name | metrics | filter_type | take_high |
|-------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|--------------|
| IV Filter | filter_param | "iv" | "threshold", "top_k", "top_percentile" | True |
| Statistic Filter | statistic_param | "max", "min", "mean", "median", "std", "var", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" | True/False |

1.
- iv\_filter: Use iv as criterion to selection features. Support
three mode: threshold value, top-k and top-percentile.

- threshold value: Filter those columns whose iv is smaller
than threshold. You can also set different threshold for
each party.
- top-k: Sort features from larger iv to smaller and take top
k features in the sorted result.
- top-percentile. Sort features from larger to smaller and
take top percentile.

2. statistic\_filter: Use statistic values calculate from DataStatistic
component. Support coefficient of variance, missing value,
percentile value etc. You can pick the columns with higher statistic
values or smaller values as you need.

3. manually: Indicate features that need to be filtered or kept.

Besides, we support multi-host federated feature selection for iv
filters. Starting in ver 2.0.0-beta, all data sets will obtain anonymous header
during transformation from local file. Guest use iv filters' logic to judge
whether a feature is left or not. Then guest sends result filter back to hosts.
During this selection process, guest will not know the real name of host(s)' features.

![Figure 4: Multi-Host Selection
Principle\</div\>](../images/multi_host_selection.png)
Loading