PrivacyProtectedArtificialGenomes

Generating artificial human genomes using GAN with privacy-preserving techniques (gradient clipping).

Installation and Preparation

Clone this repository

git clone https://github.com/cBioLab/PrivacyProtectedArtificialGenomes
cd PrivacyProtectedArtificialGenomes

Create conda environment

conda create -n ppag python=3.9
conda activate ppag
pip install -r requirements.txt

Unzip data.zip

unzip GAN_2000/data.zip
unzip GAN_805_random/data.zip
unzip GAN_805_EAS/data.zip

Integrate separately stored model information in sample directories (2000 SNP only)

Before executing this code, the following information is stored separately.
- Generator
- Discriminator
- Optimizer of the generator
- Optimizer of the discriminator
```
python model_concat.py baseline ./GAN_2000
python model_concat.py clipping ./GAN_2000
python model_concat.py dp ./GAN_2000
```
Execute each experimentals

You can experiment with the following:
- Membership inference attacks
- Genotype imputation
- Model's training
- Generate artificial genomes from trained models
The following chapters describe each experiment.

Note

Below codes are also written in scripts directory. You can either execute the following code directly or run the sh file.

Membership Inference Attacks

You can test membership inference attacks.

When executing, specify several arguments:

model_dir: Path to the directory of the target model. It is under the work_dir.
model_name: File name of the model in model_dir.
model_type: Type of the target model. Choose from [Baseline, Clipping, DP].

2000 SNP model

We have used dropout layer for the training using 2000 SNP dataset, so you need to specify below argument.

dropout: Dropout rate. We used 0.1.

Gradient Clipping Model / Differential Privacy Model

Targetting gradient clipping model and differential privacy model, you need to specify below arguments.

apply_dp: The parameter that shows the use of Opacus.
sigma: The parameter that determines the amount of noise added during training. 0 for clipping model and 0.04 for differential privacy model.
c: The parameter that determines the clipping value used during training. 0.5 for both models.

Examples

Targeting baseline model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/baseline --model_name baseline.pt --model_type Baseline --dropout 0.1

Targeting clipping model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/clipping --model_name clipping.pt --model_type Clipping --dropout 0.1 --apply_dp --sigma 0 -c 0.5

Targeting differential privacy model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --dropout 0.1 --apply_dp --sigma 0 -c 0.5

Targeting differential privacy model. (805 SNP dataset, random split)

python main.py --work_dir ./GAN_805_random --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --apply_dp --sigma 0 -c 0.5

Genotype Imputation

You can test genotype imputation using IMPUTE2.

Note

First, please download IMPUTE2 from the official website and place the executable file in the modules/imputation directory.

When executing, specify several arguments:

ref_type: The type of dataset used for reference. Choose from [1KG, GAN, Clipping, DP].
ref_haps_size: The number of haplotypes used for the reference.

If using artificial genomes

model_dir: Path to the directory of the target model. It is under the work_dir.
ag_file_name: The file name of the artificial genome in the model_dir. Zip files are also supported.

Examples

Use real data with 4000 haplotypes as a reference. (1KG_4000)

python main.py --work_dir ./GAN_2000/ --imputation --ref_type 1KG --ref_haps_size 4000

Use artificial data with 4000 haplotypes generated by baseline model as a reference. (Baseline_4000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/baseline --ag_file_name 16000_output.hapt --ref_type GAN --ref_haps_size 4000

Use artificial data with 20000 haplotypes generated by clipping model as a reference. (Clipping_20000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/clipping --ag_file_name 16000_output_regen.hapt.zip --ref_type Clip --ref_haps_size 20000

Use artificial data with 40000 haplotypes generated by differential privacy model as a reference. (DP_40000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/dp --ag_file_name 16000_output_regen.hapt.zip --ref_type DP --ref_haps_size 40000

Training

There are sample models available, so you can conduct experiments without training the model yourself, but it is also possible to train the model using dataset.

Specify the parameters of the study as arguments. See main.py for description of each parameter.

Example

Create a baseline model using a 805 SNP random dataset.

python main.py --train --work_dir ./GAN_805_random --out_dir models/baseline  --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000

Create a model applying gradient clipping using a 805 SNP excluding East Asians dataset.

python main.py --train --work_dir ./GAN_805_EAS --out_dir models/clip  --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000 --apply_dp --sigma 0 -c 0.5

Create a model applying differential privacy using a 2000 SNP dataset.

python main.py --train --work_dir ./GAN_2000 --out_dir models/dp  --g_learn 0.00008 --d_learn 0.00064 --epochs 16000 --save_that 1000 --dropout 0.1 --norm None --ag_size 4000 --apply_dp --sigma 0.04 -c 0.5 --use_poisson_sampling

Regenerate

Generate new artificial genomes from the model that has already been created.

Specify the following arguments:

model_dir: Path to the directory of the target model. It is under the work_dir.
model_name: File name of the model in model_dir.
ag_size: Number of artificial genomes to be generated.

Example

From a baseline model using a 805 SNP random dataset, generate 10000 haplotypes.

python main.py --regenerate --work_dir ./GAN_805_random --model_dir models/samples/baseline --model_name baseline.pt --ag_size 10000

From a gradient clipping model using a 2000 SNP dataset, generate 10000 haplotypes.

python main.py --regenerate --work_dir ./GAN_2000 --model_dir models/samples/clipping --model_name clipping.pt --ag_size 10000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivacyProtectedArtificialGenomes

Installation and Preparation

Membership Inference Attacks

2000 SNP model

Gradient Clipping Model / Differential Privacy Model

Examples

Genotype Imputation

Examples

Training

Example

Regenerate

Example

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
GAN_2000		GAN_2000
GAN_805_EAS		GAN_805_EAS
GAN_805_random		GAN_805_random
modules		modules
pipeline		pipeline
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model_concat.py		model_concat.py
requirements.txt		requirements.txt

cBioLab/PrivacyProtectedArtificialGenomes

Folders and files

Latest commit

History

Repository files navigation

PrivacyProtectedArtificialGenomes

Installation and Preparation

Membership Inference Attacks

2000 SNP model

Gradient Clipping Model / Differential Privacy Model

Examples

Genotype Imputation

Examples

Training

Example

Regenerate

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages