Generating artificial human genomes using GAN with privacy-preserving techniques (gradient clipping).
-
Clone this repository
git clone https://github.com/cBioLab/PrivacyProtectedArtificialGenomes cd PrivacyProtectedArtificialGenomes
-
Create conda environment
conda create -n ppag python=3.9 conda activate ppag pip install -r requirements.txt
-
Unzip data.zip
unzip GAN_2000/data.zip unzip GAN_805_random/data.zip unzip GAN_805_EAS/data.zip
-
Integrate separately stored model information in sample directories (2000 SNP only)
Before executing this code, the following information is stored separately.
- Generator
- Discriminator
- Optimizer of the generator
- Optimizer of the discriminator
python model_concat.py baseline ./GAN_2000 python model_concat.py clipping ./GAN_2000 python model_concat.py dp ./GAN_2000
-
Execute each experimentals
You can experiment with the following:
- Membership inference attacks
- Genotype imputation
- Model's training
- Generate artificial genomes from trained models
The following chapters describe each experiment.
Note
Below codes are also written in scripts
directory. You can either execute the following code directly or run the sh file.
You can test membership inference attacks.
When executing, specify several arguments:
model_dir
: Path to the directory of the target model. It is under thework_dir
.model_name
: File name of the model inmodel_dir
.model_type
: Type of the target model. Choose from[Baseline, Clipping, DP]
.
We have used dropout layer for the training using 2000 SNP dataset, so you need to specify below argument.
dropout
: Dropout rate. We used0.1
.
Targetting gradient clipping model and differential privacy model, you need to specify below arguments.
apply_dp
: The parameter that shows the use of Opacus.sigma
: The parameter that determines the amount of noise added during training.0
for clipping model and0.04
for differential privacy model.c
: The parameter that determines the clipping value used during training.0.5
for both models.
Targeting baseline model. (2000 SNP dataset)
python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/baseline --model_name baseline.pt --model_type Baseline --dropout 0.1
Targeting clipping model. (2000 SNP dataset)
python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/clipping --model_name clipping.pt --model_type Clipping --dropout 0.1 --apply_dp --sigma 0 -c 0.5
Targeting differential privacy model. (2000 SNP dataset)
python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --dropout 0.1 --apply_dp --sigma 0 -c 0.5
Targeting differential privacy model. (805 SNP dataset, random split)
python main.py --work_dir ./GAN_805_random --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --apply_dp --sigma 0 -c 0.5
You can test genotype imputation using IMPUTE2.
Note
First, please download IMPUTE2 from the official website and place the executable file in the modules/imputation directory.
When executing, specify several arguments:
ref_type
: The type of dataset used for reference. Choose from[1KG, GAN, Clipping, DP]
.ref_haps_size
: The number of haplotypes used for the reference.
If using artificial genomes
model_dir
: Path to the directory of the target model. It is under thework_dir
.ag_file_name
: The file name of the artificial genome in the model_dir. Zip files are also supported.
Use real data with 4000 haplotypes as a reference. (1KG_4000)
python main.py --work_dir ./GAN_2000/ --imputation --ref_type 1KG --ref_haps_size 4000
Use artificial data with 4000 haplotypes generated by baseline model as a reference. (Baseline_4000)
python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/baseline --ag_file_name 16000_output.hapt --ref_type GAN --ref_haps_size 4000
Use artificial data with 20000 haplotypes generated by clipping model as a reference. (Clipping_20000)
python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/clipping --ag_file_name 16000_output_regen.hapt.zip --ref_type Clip --ref_haps_size 20000
Use artificial data with 40000 haplotypes generated by differential privacy model as a reference. (DP_40000)
python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/dp --ag_file_name 16000_output_regen.hapt.zip --ref_type DP --ref_haps_size 40000
There are sample models available, so you can conduct experiments without training the model yourself, but it is also possible to train the model using dataset.
Specify the parameters of the study as arguments. See main.py
for description of each parameter.
Create a baseline model using a 805 SNP random dataset.
python main.py --train --work_dir ./GAN_805_random --out_dir models/baseline --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000
Create a model applying gradient clipping using a 805 SNP excluding East Asians dataset.
python main.py --train --work_dir ./GAN_805_EAS --out_dir models/clip --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000 --apply_dp --sigma 0 -c 0.5
Create a model applying differential privacy using a 2000 SNP dataset.
python main.py --train --work_dir ./GAN_2000 --out_dir models/dp --g_learn 0.00008 --d_learn 0.00064 --epochs 16000 --save_that 1000 --dropout 0.1 --norm None --ag_size 4000 --apply_dp --sigma 0.04 -c 0.5 --use_poisson_sampling
Generate new artificial genomes from the model that has already been created.
Specify the following arguments:
model_dir
: Path to the directory of the target model. It is under thework_dir
.model_name
: File name of the model inmodel_dir
.ag_size
: Number of artificial genomes to be generated.
From a baseline model using a 805 SNP random dataset, generate 10000 haplotypes.
python main.py --regenerate --work_dir ./GAN_805_random --model_dir models/samples/baseline --model_name baseline.pt --ag_size 10000
From a gradient clipping model using a 2000 SNP dataset, generate 10000 haplotypes.
python main.py --regenerate --work_dir ./GAN_2000 --model_dir models/samples/clipping --model_name clipping.pt --ag_size 10000