Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev 2.0.0 rc algo doc #5411

Merged
merged 6 commits into from
Dec 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/2.0/fate/components/feature_binning.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ As for calculating the federated iv and woe values, the following figure
can describe the principle properly.

![Figure 1 (Federated Feature Binning
Principle)](../images/binning_principle.png)
Principle)](../../images/binning_principle.png)

As the figure shows, B party which has the data labels encrypt its
labels with Addiction homomorphic encryption and then send to A. A
Expand All @@ -26,7 +26,7 @@ encrypted label information to all hosts, and each of the hosts
calculates and sends back the static info.

![Figure 2: Multi-Host Binning
Principle](../images/multiple_host_binning.png)
Principle](../../images/multiple_host_binning.png)

## Features

Expand Down
2 changes: 1 addition & 1 deletion doc/2.0/fate/components/feature_selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ whether a feature is left or not. Then guest sends result filter back to hosts.
During this selection process, guest will not know the real name of host(s)' features.

![Figure 4: Multi-Host Selection
Principle\</div\>](../images/multi_host_selection.png)
Principle\</div\>](../../images/multi_host_selection.png)
55 changes: 55 additions & 0 deletions doc/2.0/fate/components/hetero_nn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Hetero NN

In FATE-2.0, we introduce our new Hetero-NN framework which allows you to quickly set up a hetero federated NN learning task. Built on the PyTorch and transformers, our framework ensures smooth integration of your existing datasets and models. For a quick introduction to Hetero-NN, refer to our [quick start](../ml/hetero_nn_tutorial.md).

The architecture of the Hetero-NN framework is depicted in the figure below. In this structure, all submodels from guests and hosts are encapsulated within the HeteroNNModel, enabling independent forwards and backwards. Both guest and host trainers are developed based on the HuggingFace trainer, allowing for rapid configuration of heterogeneous federated learning tasks with your existing datasets and models. These tasks can be run independently, without the need for FATEFlow. The FATE-pipeline Hetero-NN components are built upon this foundational framework.

<div align="center">
<img src="../../images/hetero_nn.png" width="800" height="480" alt="Figure 2 (FedPass)">
</div>

Besides the new framework, we also introduce two new privacy-preserving strategies for federated learning: SSHE and FedPass. These strategies can be configured in the aggregate layer configuration. For more information on these strategies, refer to the [SSHE](#sshe) and [FedPass](#fedpass) sections below.

## SSHE

SSHENN is a privacy-preserving strategy that uses homomorphic encryption and secure sharing to protect the privacy of the model and data. The weights of guest/host aggregate layer are split into two parts, and are shared with cooperating party. Picture blow illustrates the process of SSHE. The design of SSHE is inspired by the paper: [When Homomorphic Encryption Marries Secret Sharing:
Secure Large-Scale Sparse Logistic Regression and Applications
in Risk Control](https://arxiv.org/pdf/2008.08753.pdf).

![Figure 1 (SSHE)](../../images/sshe.png)



## FedPass

FedPass works by embedding private passports into a neural network to enhance privacy and obfuscation. It utilizes the DNN passport technique for adaptive obfuscation, which involves inserting a passport layer into the network. This layer adjusts the scale factor and bias term using model parameters and private passports, followed by an autoencoder and averaging. Picture below illustrates
the process of FedPass.
<div align="center">
<img src="../../images/fedpass_1.png" alt="Figure 2 (FedPass)">
</div>


In FATE-2.0, you can specify the Fedpass strategy for guest top model and host bottom model, picture below shows the architecture of FedPass when running a hetero-nn task.

<div align="center">
<img src="../../images/fedpass_0.png" width="500" height="400" alt="Figure 2 (FedPass)">
</div>

For more details of Fedpass, please refer to the [paper](https://arxiv.org/pdf/2301.12623.pdf).


The features of Fedpass are:

- Privacy Preserving: Without access to the passports, it's extremely difficult for an attacker to infer inputs from outputs.
- Preserved model performance: The model parameters are optimized through backpropagation, adapting the obfuscation to the model, which offers superior performance compared to fixed obfuscation.
- Speed Comparable to Plaintext Training: Fedpass does not require homomorphic encryption or secure sharing, ensuring that the training speed is nearly equivalent to that of plaintext training.


## Features

- A brand new hetero-nn framework develop based on pytorch and transformers. Able to intergrate exsiting resources, like models, dataset into hetero-nn federated learning. If you are using Hetero-NN in FATE pipelien, you can configure your cutomize models, datasets via confs.

- Support SSHE strategy for privacy preserving training. You can set passport for host bottom models and guest
top model.

- Support FedPass strategy for privacy preserving training. Support single GPU training.
106 changes: 106 additions & 0 deletions doc/2.0/fate/components/hetero_secureboost.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Hetero SecureBoost

Gradient Boosting Decision Tree(GBDT) is a widely used statistic model
for classification and regression problems. FATE provides a novel
lossless privacy-preserving tree-boosting system known as
[SecureBoost: A Lossless Federated Learning Framework.](https://arxiv.org/abs/1901.08755)

This federated learning system allows a learning process to be jointly
conducted over multiple parties with partially common user samples but
different feature sets, which corresponds to a vertically partitioned
data set. An advantage of SecureBoost is that it provides the same level
of accuracy as the non privacy-preserving approach while revealing no
information on private data.

The following figure shows the proposed Federated SecureBoost framework.

![Figure 1: Framework of Federated SecureBoost](../../images/secureboost.png)

- Active Party

> We define the active party as the data provider who holds both a data
> matrix and the class label. Since the class label information is
> indispensable for supervised learning, there must be an active party
> with access to the label y. The active party naturally takes the
> responsibility as a dominating server in federated learning.

- Passive Party

> We define the data provider which has only a data matrix as a passive
> party. Passive parties play the role of clients in the federated
> learning setting. They are also in need of building a model to predict
> the class label y for their prediction purposes. Thus they must
> collaborate with the active party to build their model to predict y
> for their future users using their own features.

We align the data samples under an encryption scheme by using the
privacy-preserving protocol for inter-database intersections to find the
common shared users or data samples across the parties without
compromising the non-shared parts of the user sets.

To ensure security, passive parties cannot get access to gradient and
hessian directly. We use a "XGBoost" like tree-learning algorithm. In
order to keep gradient and hessian confidential, we require that the
active party encrypt gradient and hessian before sending them to passive
parties. After encrypted the gradient and hessian, active party will
send the encrypted [gradient] and [hessian] to passive
party. Each passive party uses [gradient] and [hessian] to
calculate the encrypted feature histograms, then encodes the (feature,
split\_bin\_val) and constructs a (feature, split\_bin\_val) lookup
table; it then sends the encoded value of (feature, split\_bin\_val)
with feature histograms to the active party. After receiving the feature
histograms from passive parties, the active party decrypts them and
finds the best gains. If the best-gain feature belongs to a passive
party, the active party sends the encoded (feature, split\_bin\_val) to
back to the owner party. The following figure shows the process of
finding split in federated tree building.

![Figure 2: Process of Federated Split Finding](../../images/split_finding.png)

The parties continue the split finding process until tree construction
finishes. Each party only knows the detailed split information of the
tree nodes where the split features are provided by the party. The
following figure shows the final structure of a single decision tree.

![Figure 3: A Single Decision Tree](../../images/tree_structure.png)

To use the learned model to classify a new instance, the active party
first judges where current tree node belongs to. If the current tree
belongs to the active party, then it can use its (feature,
split\_bin\_val) lookup table to decide whether going to left child node
or right; otherwise, the active party sends the node id to designated
passive party, the passive party checks its lookup table and sends back
which branch should the current node goes to. This process stops until
the current node is a leave. The following figure shows the federated
inference process.

![Figure 4: Process of Federated Inference](../../images/federated_inference.png)

By following the SecureBoost framework, multiple parties can jointly
build tree ensemble model without leaking privacy in federated learning.
If you want to learn more about the algorithm, you can read the paper
attached above.

## HeteroSecureBoost Features

- Support federated machine learning tasks:
- binary classification, the objective function is binary:bce
- multi classification, the objective function is multi:ce
- regression, the objective function is regression:l2

- Support multi-host federated machine learning tasks.

- Support Paillier and Ou homogeneous encryption schemes.

- Support common-used Xgboost regularization methods:
- L1 & L2 regularization
- Min childe weight
- Min Sample Split

- Support GOSS Sampling

- Support complete secure tree

- Support hist-subtraction, grad and hess optimization


21 changes: 21 additions & 0 deletions doc/2.0/fate/components/homo_nn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Homo NN

The Homo(Horizontal) federated learning in FATE-2.0 allows multiple parties to collaboratively train a neural network model without sharing their actual data. In this arrangement, different parties possess datasets with the same features but different user samples. Each party locally trains the model on its data subset and shares only the model updates, not the data itself.

Our neural network (NN) framework in FATE-2.0 is built upon PyTorch and transformers libraries, easing the integration of existing models ,including computer vision (CV) models, pretrained large language (LLM), etc., and datasets into federated training. The framework is also compatible with advanced computing resources like GPUs and DeepSpeed for enhanced training efficiency. In the HomoNN module, we support standard FedAVG algorithms. Using the FedAVGClient and FedAVGServer trainer classes, homo federated learning tasks can be set up quickly and efficiently. The trainers, developed on the transformer trainer, facilitate the consistent setting of training and federation parameters via TrainingArguments and FedAVGArguments.

Below show the architecture of the 2.0 Homo-NN framework.

![Figure 1 (SSHE)](../../images/homo_nn.png)

## Features

- A new neural network (NN) framework, developed leveraging PyTorch and transformers. This framework offers easy integration of existing models, including CV, LLM models, etc., and datasets. It's ready to use right out of the box. If you are using Homo-NN in FATE pipelien, you can configure your cutomize models, datasets via confs.

- Provides support for the FedAVG algorithm, featuring secure aggregation.

- The Trainer class includes callback support, allowing for customization of the training process.

- FedAVGClient supports a local model mode for local testing.

- Compatible with single and multi-GPU training. The framework also allows for easy integration of DeepSpeed.
2 changes: 1 addition & 1 deletion doc/2.0/fate/components/linear_regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ keys.
The process of HeteroLinR training is shown below:

![Figure 1 (Federated HeteroLinR
Principle)](../images/HeteroLinR.png)
Principle)](../../images/HeteroLinR.png)

A sample alignment process is conducted before training. The sample
alignment process identifies overlapping samples in databases of all
Expand Down
10 changes: 5 additions & 5 deletions doc/2.0/fate/components/logistic_regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ alignment process will **not** leak confidential information (e.g.,
sample ids) on the two parties since it is conducted in an encrypted
way.

![Figure 1 (Federated HeteroLR Principle)](../images/HeteroLR.png)
![Figure 1 (Federated HeteroLR Principle)](../../images/HeteroLR.png)

In the training process, party A and party B compute out the elements
needed for final gradients. Arbiter aggregate them and compute out the
Expand All @@ -44,7 +44,7 @@ criterion. Since the arbiter can obtain the completed model weight, the
convergence decision is happening in Arbiter.

![Figure 2 (Federated Multi-host HeteroLR
Principle)](../images/hetero_lr_multi_host.png)
Principle)](../../images/hetero_lr_multi_host.png)

# Heterogeneous SSHE Logistic Regression

Expand All @@ -57,12 +57,12 @@ We have also made some optimization so that the code may not exactly
same with this paper.
The training process could be described as the
following: forward and backward process.
![Figure 3 (forward)](../images/sshe-lr_forward.png)
![Figure 4 (backward)](../images/sshe-lr_backward.png)
![Figure 3 (forward)](../../images/sshe-lr_forward.png)
![Figure 4 (backward)](../../images/sshe-lr_backward.png)

The training process is based secure matrix multiplication protocol(SMM),
which HE and Secret-Sharing hybrid protocol is included.
![Figure 5 (SMM)](../images/secure_matrix_multiplication.png)
![Figure 5 (SMM)](../../images/secure_matrix_multiplication.png)

## Features

Expand Down
2 changes: 1 addition & 1 deletion doc/2.0/fate/components/psi.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ which offers 128 bits of security with key size of 256 bits.
Below is an illustration of ECDH intersection.

![Figure 1 (ECDH
PSI)](../images/ecdh_intersection.png)
PSI)](../../images/ecdh_intersection.png)

For details on how to hash value to given curve,
please refer [here](https://datatracker.ietf.org/doc/html/draft-irtf-cfrg-hash-to-curve-10#section-6.7.1).
Expand Down
1 change: 1 addition & 0 deletions doc/2.0/fate/components/union.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ Union currently supports concatenation along axis 0.

For tables to be concatenated, their header, including sample id, match id, and label column (if label exists),
should match. Example of such a union task may be found [here](../../../examples/pipeline/union/test_union.py).
Signed-off-by: weijingchen <talkingwallace@sohu.com>
Loading