Merge pull request #5411 from FederatedAI/dev-2.0.0-rc-algo-doc

Dev 2.0.0 rc algo doc
FederatedAI · Dec 29, 2023 · 59c1785 · 59c1785
2 parents e4d6b3b + 8295a65
commit 59c1785
Show file tree

Hide file tree

Showing 26 changed files with 1,105 additions and 13 deletions.
diff --git a/doc/2.0/fate/components/feature_binning.md b/doc/2.0/fate/components/feature_binning.md
@@ -14,7 +14,7 @@ As for calculating the federated iv and woe values, the following figure
 can describe the principle properly.
 
 ![Figure 1 (Federated Feature Binning
-Principle)](../images/binning_principle.png)
+Principle)](../../images/binning_principle.png)
 
 As the figure shows, B party which has the data labels encrypt its
 labels with Addiction homomorphic encryption and then send to A. A
@@ -26,7 +26,7 @@ encrypted label information to all hosts, and each of the hosts
 calculates and sends back the static info.
 
 ![Figure 2： Multi-Host Binning
-Principle](../images/multiple_host_binning.png)
+Principle](../../images/multiple_host_binning.png)
 
 ## Features
 

diff --git a/doc/2.0/fate/components/feature_selection.md b/doc/2.0/fate/components/feature_selection.md
@@ -54,4 +54,4 @@ whether a feature is left or not. Then guest sends result filter back to hosts.
 During this selection process, guest will not know the real name of host(s)' features.
 
 ![Figure 4: Multi-Host Selection
-Principle\</div\>](../images/multi_host_selection.png)
+Principle\</div\>](../../images/multi_host_selection.png)
diff --git a/doc/2.0/fate/components/hetero_nn.md b/doc/2.0/fate/components/hetero_nn.md
@@ -0,0 +1,55 @@
+# Hetero NN
+
+In FATE-2.0, we introduce our new Hetero-NN framework which allows you to quickly set up a hetero federated NN learning task. Built on the PyTorch and transformers, our framework ensures smooth integration of your existing datasets and models. For a quick introduction to Hetero-NN, refer to our [quick start](../ml/hetero_nn_tutorial.md).
+
+The architecture of the Hetero-NN framework is depicted in the figure below. In this structure, all submodels from guests and hosts are encapsulated within the HeteroNNModel, enabling independent forwards and backwards. Both guest and host trainers are developed based on the HuggingFace trainer, allowing for rapid configuration of heterogeneous federated learning tasks with your existing datasets and models. These tasks can be run independently, without the need for FATEFlow. The FATE-pipeline Hetero-NN components are built upon this foundational framework.
+
+<div align="center">
+    <img src="../../images/hetero_nn.png" width="800" height="480" alt="Figure 2 (FedPass)">
+</div>
+
+Besides the new framework, we also introduce two new privacy-preserving strategies for federated learning: SSHE and FedPass. These strategies can be configured in the aggregate layer configuration. For more information on these strategies, refer to the [SSHE](#sshe) and [FedPass](#fedpass) sections below.
+
+## SSHE
+
+SSHENN is a privacy-preserving strategy that uses homomorphic encryption and secure sharing to protect the privacy of the model and data. The weights of guest/host aggregate layer are split into two parts, and are shared with cooperating party. Picture blow illustrates the process of SSHE. The design of SSHE is inspired by the paper: [When Homomorphic Encryption Marries Secret Sharing:
+Secure Large-Scale Sparse Logistic Regression and Applications
+in Risk Control](https://arxiv.org/pdf/2008.08753.pdf).
+
+![Figure 1 (SSHE)](../../images/sshe.png)
+
+
+
+## FedPass
+
+FedPass works by embedding private passports into a neural network to enhance privacy and obfuscation. It utilizes the DNN passport technique for adaptive obfuscation, which involves inserting a passport layer into the network. This layer adjusts the scale factor and bias term using model parameters and private passports, followed by an autoencoder and averaging. Picture below illustrates
+the process of FedPass.
+<div align="center">
+    <img src="../../images/fedpass_1.png" alt="Figure 2 (FedPass)">
+</div>
+
+
+In FATE-2.0, you can specify the Fedpass strategy for guest top model and host bottom model, picture below shows the architecture of FedPass when running a hetero-nn task.
+
+<div align="center">
+    <img src="../../images/fedpass_0.png" width="500" height="400" alt="Figure 2 (FedPass)">
+</div>
+
+For more details of Fedpass, please refer to the [paper](https://arxiv.org/pdf/2301.12623.pdf).
+
+
+The features of Fedpass are:
+
+- Privacy Preserving: Without access to the passports, it's extremely difficult for an attacker to infer inputs from outputs.
+- Preserved model performance: The model parameters are optimized through backpropagation, adapting the obfuscation to the model, which offers superior performance compared to fixed obfuscation.
+- Speed Comparable to Plaintext Training: Fedpass does not require homomorphic encryption or secure sharing, ensuring that the training speed is nearly equivalent to that of plaintext training.
+
+
+## Features 
+
+- A brand new hetero-nn framework develop based on pytorch and transformers. Able to intergrate exsiting resources, like models, dataset into hetero-nn federated learning. If you are using Hetero-NN in FATE pipelien, you can configure your cutomize models, datasets via confs.
+
+- Support SSHE strategy for privacy preserving training. You can set passport for host bottom models and guest
+top model.
+
+- Support FedPass strategy for privacy preserving training. Support single GPU training.
diff --git a/doc/2.0/fate/components/hetero_secureboost.md b/doc/2.0/fate/components/hetero_secureboost.md
@@ -0,0 +1,106 @@
+# Hetero SecureBoost
+
+Gradient Boosting Decision Tree(GBDT) is a widely used statistic model
+for classification and regression problems. FATE provides a novel
+lossless privacy-preserving tree-boosting system known as
+[SecureBoost: A Lossless Federated Learning Framework.](https://arxiv.org/abs/1901.08755)
+
+This federated learning system allows a learning process to be jointly
+conducted over multiple parties with partially common user samples but
+different feature sets, which corresponds to a vertically partitioned
+data set. An advantage of SecureBoost is that it provides the same level
+of accuracy as the non privacy-preserving approach while revealing no
+information on private data.
+
+The following figure shows the proposed Federated SecureBoost framework.
+
+![Figure 1: Framework of Federated SecureBoost](../../images/secureboost.png)
+
+  - Active Party
+
+    > We define the active party as the data provider who holds both a data
+    > matrix and the class label. Since the class label information is
+    > indispensable for supervised learning, there must be an active party
+    > with access to the label y. The active party naturally takes the
+    > responsibility as a dominating server in federated learning.
+
+  - Passive Party
+
+    > We define the data provider which has only a data matrix as a passive
+    > party. Passive parties play the role of clients in the federated
+    > learning setting. They are also in need of building a model to predict
+    > the class label y for their prediction purposes. Thus they must
+    > collaborate with the active party to build their model to predict y
+    > for their future users using their own features.
+
+We align the data samples under an encryption scheme by using the
+privacy-preserving protocol for inter-database intersections to find the
+common shared users or data samples across the parties without
+compromising the non-shared parts of the user sets.
+
+To ensure security, passive parties cannot get access to gradient and
+hessian directly. We use a "XGBoost" like tree-learning algorithm. In
+order to keep gradient and hessian confidential, we require that the
+active party encrypt gradient and hessian before sending them to passive
+parties. After encrypted the gradient and hessian, active party will
+send the encrypted [gradient] and [hessian] to passive
+party. Each passive party uses [gradient] and [hessian] to
+calculate the encrypted feature histograms, then encodes the (feature,
+split\_bin\_val) and constructs a (feature, split\_bin\_val) lookup
+table; it then sends the encoded value of (feature, split\_bin\_val)
+with feature histograms to the active party. After receiving the feature
+histograms from passive parties, the active party decrypts them and
+finds the best gains. If the best-gain feature belongs to a passive
+party, the active party sends the encoded (feature, split\_bin\_val) to
+back to the owner party. The following figure shows the process of
+finding split in federated tree building.
+
+![Figure 2: Process of Federated Split Finding](../../images/split_finding.png)
+
+The parties continue the split finding process until tree construction
+finishes. Each party only knows the detailed split information of the
+tree nodes where the split features are provided by the party. The
+following figure shows the final structure of a single decision tree.
+
+![Figure 3: A Single Decision Tree](../../images/tree_structure.png)
+
+To use the learned model to classify a new instance, the active party
+first judges where current tree node belongs to. If the current tree
+belongs to the active party, then it can use its (feature,
+split\_bin\_val) lookup table to decide whether going to left child node
+or right; otherwise, the active party sends the node id to designated
+passive party, the passive party checks its lookup table and sends back
+which branch should the current node goes to. This process stops until
+the current node is a leave. The following figure shows the federated
+inference process.
+
+![Figure 4: Process of Federated Inference](../../images/federated_inference.png)
+
+By following the SecureBoost framework, multiple parties can jointly
+build tree ensemble model without leaking privacy in federated learning.
+If you want to learn more about the algorithm, you can read the paper
+attached above.
+
+## HeteroSecureBoost Features
+
+- Support federated machine learning tasks:
+    - binary classification, the objective function is binary:bce
+    - multi classification, the objective function is multi:ce
+    - regression, the objective function is regression:l2
+
+- Support multi-host federated machine learning tasks.
+
+- Support Paillier and Ou homogeneous encryption schemes.
+
+- Support common-used Xgboost regularization methods:
+    - L1 & L2 regularization
+    - Min childe weight
+    - Min Sample Split
+
+- Support GOSS Sampling
+
+- Support complete secure tree
+
+- Support hist-subtraction, grad and hess optimization
+
+
diff --git a/doc/2.0/fate/components/homo_nn.md b/doc/2.0/fate/components/homo_nn.md
@@ -0,0 +1,21 @@
+# Homo NN
+
+The Homo(Horizontal) federated learning in FATE-2.0 allows multiple parties to collaboratively train a neural network model without sharing their actual data. In this arrangement, different parties possess datasets with the same features but different user samples. Each party locally trains the model on its data subset and shares only the model updates, not the data itself.
+
+Our neural network (NN) framework in FATE-2.0 is built upon PyTorch and transformers libraries, easing the integration of existing models ,including computer vision (CV) models, pretrained large language (LLM), etc., and datasets into federated training. The framework is also compatible with advanced computing resources like GPUs and DeepSpeed for enhanced training efficiency. In the HomoNN module, we support standard FedAVG algorithms. Using the FedAVGClient and FedAVGServer trainer classes, homo federated learning tasks can be set up quickly and efficiently. The trainers, developed on the transformer trainer, facilitate the consistent setting of training and federation parameters via TrainingArguments and FedAVGArguments.
+
+Below show the architecture of the 2.0 Homo-NN framework.
+
+![Figure 1 (SSHE)](../../images/homo_nn.png)
+
+## Features
+
+-  A new neural network (NN) framework, developed leveraging PyTorch and transformers. This framework offers easy integration of existing models, including CV, LLM models, etc., and datasets. It's ready to use right out of the box. If you are using Homo-NN in FATE pipelien, you can configure your cutomize models, datasets via confs.
+
+- Provides support for the FedAVG algorithm, featuring secure aggregation.
+
+- The Trainer class includes callback support, allowing for customization of the training process.
+
+- FedAVGClient supports a local model mode for local testing.
+
+- Compatible with single and multi-GPU training. The framework also allows for easy integration of DeepSpeed.
diff --git a/doc/2.0/fate/components/linear_regression.md b/doc/2.0/fate/components/linear_regression.md
@@ -24,7 +24,7 @@ keys.
 The process of HeteroLinR training is shown below:
 
 ![Figure 1 (Federated HeteroLinR
-Principle)](../images/HeteroLinR.png)
+Principle)](../../images/HeteroLinR.png)
 
 A sample alignment process is conducted before training. The sample
 alignment process identifies overlapping samples in databases of all

diff --git a/doc/2.0/fate/components/logistic_regression.md b/doc/2.0/fate/components/logistic_regression.md
@@ -28,7 +28,7 @@ alignment process will **not** leak confidential information (e.g.,
 sample ids) on the two parties since it is conducted in an encrypted
 way.
 
-![Figure 1 (Federated HeteroLR Principle)](../images/HeteroLR.png)
+![Figure 1 (Federated HeteroLR Principle)](../../images/HeteroLR.png)
 
 In the training process, party A and party B compute out the elements
 needed for final gradients. Arbiter aggregate them and compute out the
@@ -44,7 +44,7 @@ criterion. Since the arbiter can obtain the completed model weight, the
 convergence decision is happening in Arbiter.
 
 ![Figure 2 (Federated Multi-host HeteroLR
-Principle)](../images/hetero_lr_multi_host.png)
+Principle)](../../images/hetero_lr_multi_host.png)
 
 # Heterogeneous SSHE Logistic Regression
 
@@ -57,12 +57,12 @@ We have also made some optimization so that the code may not exactly
 same with this paper.
 The training process could be described as the
 following: forward and backward process.
-![Figure 3 (forward)](../images/sshe-lr_forward.png)
-![Figure 4 (backward)](../images/sshe-lr_backward.png)
+![Figure 3 (forward)](../../images/sshe-lr_forward.png)
+![Figure 4 (backward)](../../images/sshe-lr_backward.png)
 
 The training process is based secure matrix multiplication protocol(SMM),
 which HE and Secret-Sharing hybrid protocol is included.
-![Figure 5 (SMM)](../images/secure_matrix_multiplication.png)
+![Figure 5 (SMM)](../../images/secure_matrix_multiplication.png)
 
 ## Features
 

diff --git a/doc/2.0/fate/components/psi.md b/doc/2.0/fate/components/psi.md
@@ -8,7 +8,7 @@ which offers 128 bits of security with key size of 256 bits.
 Below is an illustration of ECDH intersection.
 
 ![Figure 1 (ECDH
-PSI)](../images/ecdh_intersection.png)
+PSI)](../../images/ecdh_intersection.png)
 
 For details on how to hash value to given curve,
 please refer [here](https://datatracker.ietf.org/doc/html/draft-irtf-cfrg-hash-to-curve-10#section-6.7.1).

diff --git a/doc/2.0/fate/components/union.md b/doc/2.0/fate/components/union.md
@@ -11,3 +11,4 @@ Union currently supports concatenation along axis 0.
 
 For tables to be concatenated, their header, including sample id, match id, and label column (if label exists),
 should match. Example of such a union task may be found [here](../../../examples/pipeline/union/test_union.py).
+Signed-off-by: weijingchen <talkingwallace@sohu.com>
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,3 +11,4 @@ Union currently supports concatenation along axis 0.

		For tables to be concatenated, their header, including sample id, match id, and label column (if label exists),
		should match. Example of such a union task may be found [here](../../../examples/pipeline/union/test_union.py).
		Signed-off-by: weijingchen <talkingwallace@sohu.com>