A collection of notes on Secure and Private AI Scholarship Challenge 2019.
Contributions are always welcome!
- Lesson 3: Introducing Differential Privacy
- Lesson 4: Evaluating the Privacy of a Function
- Evaluating The Privacy Of A Function
- Project Intro Evaluating The Privacy Of A Function
- Project Demo Evaluating The Privacy Of A Function
- Project Intro Calculate L1 Sensitivity For Threshold
- Project Demo Calculate L1 Sensitivity For Threshold
- Project Intro Perform A Differencing Attack
- Project Demo Perform A Differencing Attack
- Lesson 5: Introducing Local and Global Differential Privacy
- Introducing Local and Global Differential Privacy
- Making a Function Differentially Private
- Project Intro Implement Local Differential Privacy
- Project Demo Implement Local Differential Privacy
- Project Intro Varying the Amount of Noise
- Project Demo Varying the Amount of Noise
- The Formal Definition of Differential Privacy
- Create a Differentially Private Query
- Project Demo Create a Differentially Private Query
- Lesson 6: Differential Privacy for Deep Learning
- Differential Privacy for Deep Learning
- Project Intro Example Scenario Deep Learning in a Hospital
- Generating Differentially Private Labels For a Dataset
- PATE Analysis
- Where to Go From Here
- Final Project Description
- Guest Interview: Differential Privacy at Apple
- Guest Interview: Privacy and Society - OpenAI
- Lesson 7: Federated Learning
- Introducing Federated Learning
- Introducing PySyft
- Introducing OpenMined and Installing PySyft
- Basic Remote Execution in PySyft
- Playing with Remote Tensors
- Introducing Remote Arithmetic in PySyft
- Learn a Simple Linear Model
- Garbage Collection and Common Errors
- Toy Federated Learning
- Advanced Remote Execution Tools
- PointerChain Operations
- Final Project Description
- Guest Interview: Federated Learning at Google
- Notebook
- Lesson 8: Securing Federated Learning
- Securing Federated Learning
- Project Demo Federated Learning with Trusted Aggregator
- Intro to Additive Secret Sharing
- Fixing Additive Secret Sharing
- Project Intro Build Methods for Encrypt Decrypt and Add
- Project Demo - Build Methods for Encrypt, Decrypt, and Add
- Intro to Fixed Precision Encoding
- Secret Sharing and Fixed Precision in PySyft
- Final Project Description
- Lesson 9: Encrypted Deep Learning
- Introducing Encrypted Deep Learning
- Encrypted Subtraction and Public Multiplication
- Encrypted Computation in PySyft
- Project Intro - Build an Encrypted Database
- Project Demo - Build an Encrypted Database
- Encrypted Deep Learning in PyTorch
- Encrypted Deep Learning in Keras
- Keystone Project Description
- Secure & Private AI Program Conclusion
- Resources
- Credits
- When doing artificial intelligence in the real world, most datasets are siloed (isolated) within large enterprises for two reasons:
- Enterprises have a legal risk which prevents them form wanting to share their dataset outside their organization
- Enterprises have a competitive advantage to hang onto large datasets collected from/about their customers
- In this lesson we're going to be talking about differential privacy in the context of deep learning.
- In this context, differential privacy is about ensuring that when our neural networks are learning from sensitive data, they're only learning what they're supposed to learn from the data.
-
It's a new field, recently started with statistical database queries around 2003 and even more recently
-
General goal of DP is to ensure that different kinds of statistical analysis don't co mpromise privacy
-
Privacy is preserved if
After the analysis, the analyzer doesn't know anything about the people in the dataset. They remain "unobserved"
-
Dalenius's Ad Omnia Guarantee (1977)
Anything that can be learned about a participant from the statistical database can be learned without access to the database
-
Above definition is basically saying, anything you actually do learn about a person should be only public information
-
Cynthia Dwork, Algorithmic Foundations of Differential Privacy
"Differential Privacy" describes a promise, made by a data holder, or curator, to a data subject, and the promise is like this: "You will not be affected, adversely or otherwiese, by allowing your data to be used in any study or analysis, no matter what other sudies, data sets, or information sources, are available"
-
True goal of DP is to propose these tools and techniques that allow a data holder to make these promises to individuals who are being studied.
- We can't just anonymize data because if someone else releases a related anonymized private dataset, often it can be possible to divulge the private aspects of the information you're trying to hide by studiying these two separate dataset releases.
- Simple (canonical) database is A database with a single column with one row for each person
- If we remove a person from the database, and the query does not change, then that person's privacy is fully protected.
- If the query doesn't change even we remove someone from the database, then that person wasn't leaking any statistical information into the output of the query.
- Write a sort of function that makes it so that you can take this database and create 5000 other databases each with one person missing
- You should end up with 5000 databases of length 4999
- Compare the output of the query on the entire database with the output of the query on each of the parallel databases
- Sensitivity (L1) is the maximum amount that the query changes when removing an individual from the database
- The output of the sum is conditioned on every individual that is a 1 in the database
- Create a single function called sensitivity (query, n_entries)
- Initialize a database of correct size
- Initialize all parallel databases
- Run the query over all databases
- Correctly calculate sensitivity
- Return the sensitivity
- Project 3
- Create the query() function
- Create 10 databases of size 10
- Query each database with a threshold of 5 (calculate sensitivity)
- Print out the sensitivity of each database
- In this concept, we're going to explore how to compromise or attack differential privacy
- All we would have to do is query for the sum of the entire database and then the sum of the entire database without that person. In SQL, this might look something like this
- SELECT count(*) from my_cancer_database;
- SELECT count(*) from my_cancer_database WHERE person_name != "john doe";
- The purpose of this exercise is to give you an intuition for how privacy can fail in these environments.
- Local Differential Privacy adds noise to function data points (function inputs)
- Global Differential Privacy adds noise to function outputs
- Trusted Curator is an owner of a database upon which Global Differential privacy is applied. They are trusted to apply DP correctly.
- Differential Privacy always requires a form of randomness or noise added to the query to protect from things like Differencing Attacks.
- Randomized Response is technique that is used in social sciences when trying to learn about the high level trends for a taboo behavior
- Have you ever jaywalked, perhaps in the last week?
- Plausible Deniability
- Flip a coin two times
- If the first coin flip is heads, answer (yes/no) honestly
- If the first coin flip is tails, answer according to the second coin flip
- Differential Privacy
- Most accurate query with the greatest amount of privacy
- Greatest fit with trust models in the actual world (don't waste trust)
- Implement randomized response in our database
- Flip two coins by generate two random 1/0 responses in Python
- Report both the true query and the noised query for database sizes 10, 100, 1000, and 10,000
- Augment the randomized response query from the previous project to allow for varying amounts of randomness to be added
- Varying the amount of noise
- Add a new parameter to the query function. It will now accept the database and some noise parameter which is percentage
- Properly rebalance the result of the query given this adjustable parameter
- What we cover so far
- Local Differential Privacy
- Differencing Attack
- Basic queries
- Sensitivity
- Differential Privacy definition
- Global Differential Privacy
- How much noise should we add after the query has been run?
- "Epsilon" and "Delta" measure a threshold for leakage
- How do we actually use epsilon and delta?
- Randomized mechanism is a function with random noise added to its inputs, outputs, and/or inner workings.
- Global Differential Privacy adds noise to the output of a query.
- Local Differential Privacy adds noise to each data input to the query.
- Privacy budget is how much epsilon/delta leakage we allow for our analysis
- Types of noise
- Gaussian
- Laplacian
- How much noise should we add?
- Type of Noise (Gaussian/Laplacian)
- Sensitivity of Query
- Desired Epsilon (E)
- Desired Delta (d)
- Laplacian noise
- b = sensitivity(query)/epsilon
- d always zero
- Laplace function:
np.random.laplace
- Perfect Privacy is a query to a database returns the same value even if we remove any person from the database
- Perfect Privacy (AI model) is training a model on a dataset should return the same model even if we remove any person from the training dataset
- Two points of complexity
- Do we always know where "people" are referenced in the dataset?
- Neural models rarely ever train to the same location, even when trained on the same dataset twice
- Ask each hospital to train a model on their own dataset
- Use each model to predict on your own local dataset, generating 10 labels for each datapoint
- Perform a DP query to generate the final true (DP) label for each datapooint
- Retrain a new model on our local dataset which now has DP labels
- Read:
- Topics:
- The Exponential Mechanism
- The Moment's Accountant
- Differentially Private Stochastic Gradient Descent
- Advice:
- For deployments - stick with public frameworks!
- Join the Differential Privacy Community
- DP is still in the early days
- Labelled private dataset which you must keep differentially private
- Public unlabeled dataset which you don't need to keep differentially private
- Label public unlabeled data using private dataset and train model based on that public data
- Federated Learning is a technique for training Machine Learning models on data to which you do not have access
- PySyft is the extension to the major deep learning toolkits
- Trusted Aggregator is a neutral 3rd party who has a machine that we can trust to not look at the gradients when performing the aggregation
- Additive secret sharing allows multiple individuals to add numbers together without any person learning anyone else's inputs to the addition
- We add modulus as Q to decryption process which will actually be shares summed together
- encrypt()
- decrypt()
- add
- Federated Learning with Encrypted Gradient Aggregation
- Lesson: Encrypted Deep Learning in Keras
- Step 2: Private Prediction using Syft Keras - Serving (Client)
- Images and notes taken from lectures videos at Secure and Private AI