Skip to content

agungsantoso/secure-and-private-ai-scholarship-challenge-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Secure and Private AI Scholarship Challenge 2019 Notes

GitHub

A collection of notes on Secure and Private AI Scholarship Challenge 2019.

Contributions are always welcome!

Lesson 3: Introducing Differential Privacy

Secure & Private AI Program Introduction

  • When doing artificial intelligence in the real world, most datasets are siloed (isolated) within large enterprises for two reasons:
    1. Enterprises have a legal risk which prevents them form wanting to share their dataset outside their organization
    2. Enterprises have a competitive advantage to hang onto large datasets collected from/about their customers

Lesson 1 Introduction

  • In this lesson we're going to be talking about differential privacy in the context of deep learning.
  • In this context, differential privacy is about ensuring that when our neural networks are learning from sensitive data, they're only learning what they're supposed to learn from the data.

What Is Differential Privacy (DP)

  • It's a new field, recently started with statistical database queries around 2003 and even more recently

  • General goal of DP is to ensure that different kinds of statistical analysis don't co mpromise privacy

  • Privacy is preserved if

    After the analysis, the analyzer doesn't know anything about the people in the dataset. They remain "unobserved"

  • Dalenius's Ad Omnia Guarantee (1977)

    Anything that can be learned about a participant from the statistical database can be learned without access to the database

  • Above definition is basically saying, anything you actually do learn about a person should be only public information

  • Cynthia Dwork, Algorithmic Foundations of Differential Privacy

    "Differential Privacy" describes a promise, made by a data holder, or curator, to a data subject, and the promise is like this: "You will not be affected, adversely or otherwiese, by allowing your data to be used in any study or analysis, no matter what other sudies, data sets, or information sources, are available"

  • True goal of DP is to propose these tools and techniques that allow a data holder to make these promises to individuals who are being studied.

Can We Just Anonymize Data

  • We can't just anonymize data because if someone else releases a related anonymized private dataset, often it can be possible to divulge the private aspects of the information you're trying to hide by studiying these two separate dataset releases.

Introducing The Canonical Database

  • Simple (canonical) database is A database with a single column with one row for each person
  • If we remove a person from the database, and the query does not change, then that person's privacy is fully protected.
  • If the query doesn't change even we remove someone from the database, then that person wasn't leaking any statistical information into the output of the query.

Project Intro Build A Private Database In Python

  • Write a sort of function that makes it so that you can take this database and create 5000 other databases each with one person missing
  • You should end up with 5000 databases of length 4999

Notebook Exercise Instructions

Project Demo Build A Private Database In Python

Lesson 4: Evaluating the Privacy of a Function

Evaluating The Privacy Of A Function

  • Compare the output of the query on the entire database with the output of the query on each of the parallel databases
  • Sensitivity (L1) is the maximum amount that the query changes when removing an individual from the database
  • The output of the sum is conditioned on every individual that is a 1 in the database

Project Intro Evaluating The Privacy Of A Function

  • Create a single function called sensitivity (query, n_entries)
    • Initialize a database of correct size
    • Initialize all parallel databases
    • Run the query over all databases
    • Correctly calculate sensitivity
    • Return the sensitivity

Project Demo Evaluating The Privacy Of A Function

Project Intro Calculate L1 Sensitivity For Threshold

  • Project 3
    • Create the query() function
    • Create 10 databases of size 10
    • Query each database with a threshold of 5 (calculate sensitivity)
    • Print out the sensitivity of each database

Project Demo Calculate L1 Sensitivity For Threshold

Project Intro Perform A Differencing Attack

  • In this concept, we're going to explore how to compromise or attack differential privacy
  • All we would have to do is query for the sum of the entire database and then the sum of the entire database without that person. In SQL, this might look something like this
    • SELECT count(*) from my_cancer_database;
    • SELECT count(*) from my_cancer_database WHERE person_name != "john doe";
  • The purpose of this exercise is to give you an intuition for how privacy can fail in these environments.

Project Demo Perform A Differencing Attack

Lesson 5: Introducing Local and Global Differential Privacy

Introducing Local and Global Differential Privacy

  • Local Differential Privacy adds noise to function data points (function inputs)
  • Global Differential Privacy adds noise to function outputs
  • Trusted Curator is an owner of a database upon which Global Differential privacy is applied. They are trusted to apply DP correctly.

Making a Function Differentially Private

  • Differential Privacy always requires a form of randomness or noise added to the query to protect from things like Differencing Attacks.
  • Randomized Response is technique that is used in social sciences when trying to learn about the high level trends for a taboo behavior
    • Have you ever jaywalked, perhaps in the last week?
  • Plausible Deniability
    • Flip a coin two times
    • If the first coin flip is heads, answer (yes/no) honestly
    • If the first coin flip is tails, answer according to the second coin flip
  • Differential Privacy
    • Most accurate query with the greatest amount of privacy
    • Greatest fit with trust models in the actual world (don't waste trust)

Project Intro Implement Local Differential Privacy

  • Implement randomized response in our database
  • Flip two coins by generate two random 1/0 responses in Python
  • Report both the true query and the noised query for database sizes 10, 100, 1000, and 10,000

Project Demo Implement Local Differential Privacy

Project Intro Varying the Amount of Noise

  • Augment the randomized response query from the previous project to allow for varying amounts of randomness to be added
  • Varying the amount of noise
    • Add a new parameter to the query function. It will now accept the database and some noise parameter which is percentage
    • Properly rebalance the result of the query given this adjustable parameter

Project Demo Varying the Amount of Noise

The Formal Definition of Differential Privacy

  • What we cover so far
    • Local Differential Privacy
    • Differencing Attack
    • Basic queries
    • Sensitivity
    • Differential Privacy definition
    • Global Differential Privacy
  • How much noise should we add after the query has been run?
  • "Epsilon" and "Delta" measure a threshold for leakage

Create a Differentially Private Query

  • How do we actually use epsilon and delta?
  • Randomized mechanism is a function with random noise added to its inputs, outputs, and/or inner workings.
  • Global Differential Privacy adds noise to the output of a query.
  • Local Differential Privacy adds noise to each data input to the query.
  • Privacy budget is how much epsilon/delta leakage we allow for our analysis
  • Types of noise
    • Gaussian
    • Laplacian
  • How much noise should we add?
    • Type of Noise (Gaussian/Laplacian)
    • Sensitivity of Query
    • Desired Epsilon (E)
    • Desired Delta (d)
  • Laplacian noise
    • b = sensitivity(query)/epsilon
    • d always zero
  • Laplace function: np.random.laplace

Project Demo Create a Differentially Private Query

Lesson 6: Differential Privacy for Deep Learning

Differential Privacy for Deep Learning

  • Perfect Privacy is a query to a database returns the same value even if we remove any person from the database
  • Perfect Privacy (AI model) is training a model on a dataset should return the same model even if we remove any person from the training dataset
  • Two points of complexity
    • Do we always know where "people" are referenced in the dataset?
    • Neural models rarely ever train to the same location, even when trained on the same dataset twice

Project Intro Example Scenario Deep Learning in a Hospital

  • Ask each hospital to train a model on their own dataset
  • Use each model to predict on your own local dataset, generating 10 labels for each datapoint
  • Perform a DP query to generate the final true (DP) label for each datapooint
  • Retrain a new model on our local dataset which now has DP labels

Generating Differentially Private Labels For a Dataset

PATE Analysis

Where to Go From Here

Final Project Description

  • Labelled private dataset which you must keep differentially private
  • Public unlabeled dataset which you don't need to keep differentially private
  • Label public unlabeled data using private dataset and train model based on that public data

Guest Interview: Differential Privacy at Apple

Guest Interview: Privacy and Society - OpenAI

Lesson 7: Federated Learning

Introducing Federated Learning

  • Federated Learning is a technique for training Machine Learning models on data to which you do not have access

Introducing PySyft

  • PySyft is the extension to the major deep learning toolkits

Introducing OpenMined and Installing PySyft

Basic Remote Execution in PySyft

Playing with Remote Tensors

Introducing Remote Arithmetic in PySyft

Learn a Simple Linear Model

Garbage Collection and Common Errors

Toy Federated Learning

Advanced Remote Execution Tools

PointerChain Operations

Final Project Description

Guest Interview: Federated Learning at Google

Notebook

Lesson 8: Securing Federated Learning

Securing Federated Learning

  • Trusted Aggregator is a neutral 3rd party who has a machine that we can trust to not look at the gradients when performing the aggregation

Project Demo Federated Learning with Trusted Aggregator

Intro to Additive Secret Sharing

  • Additive secret sharing allows multiple individuals to add numbers together without any person learning anyone else's inputs to the addition

Fixing Additive Secret Sharing

  • We add modulus as Q to decryption process which will actually be shares summed together

Project Intro Build Methods for Encrypt Decrypt and Add

  • encrypt()
  • decrypt()
  • add

Project Demo - Build Methods for Encrypt, Decrypt, and Add

Intro to Fixed Precision Encoding

Secret Sharing and Fixed Precision in PySyft

Final Project Description

  • Federated Learning with Encrypted Gradient Aggregation

Lesson 9: Encrypted Deep Learning

Introducing Encrypted Deep Learning

Encrypted Subtraction and Public Multiplication

Encrypted Computation in PySyft

Project Intro - Build an Encrypted Database

Project Demo - Build an Encrypted Database

Encrypted Deep Learning in PyTorch

Encrypted Deep Learning in Keras

Keystone Project Description

Secure & Private AI Program Conclusion

Resources

Credits

  1. Images and notes taken from lectures videos at Secure and Private AI