A tutorial for generating and evaluating synthetic health data based on MIMIC-IV V2.0 dataset and EMR-WGAN.

This repository is paired with the following tutorial paper:

Yan C, Zhang Z, Nyemba S, Li Z. Generating synthetic electronic health record data using generative adversarial networks: A tutorial.

System requirement

OS Requirements

This package is supported for Linux. The package has been tested on the following systems:

Linux: Ubuntu 20.04

Python (3.7.16) Dependencies

tensorflow
pandas
numpy
pandas
argparse
matplotlib
scipy
sklearn
lightgbm
joblib
shap
requests

Install all Python dependencies

pip install -r requirements.txt

Descriptions of files

Data preprocessing

Before this step, one needs to download MIMIC-IV V2.0 data at https://physionet.org/content/mimiciv/2.0/ by completing the required application and training steps.

Please run data_extraction_and_preprocessing_github.ipynb file step by step to prepare the dataset for the subsequent GAN training.

GAN training

Train an EMR-WGAN model by specifying gpu_id and model_id and then running

python GAN_training.py --gpu_id xx --model_id xx

Synthetic data generation

Generate synthetic data from a trained EMR-WGAN model by specifying gpu_id, model_id, and load_checkpoint (ie, checkpoint id). We recommend running the training multiple times to select the optimal one based on the subsequent evaluation results.

python GAN_generation.py --gpu_id xx --model_id xx --load_checkpoint xx

Data quality evaluation using common metrics

Please run utility_evaluation.ipynb file step by step to evaluate the utility of the generated synthetic EHR datasets.
Multiple common utility metrics have been included.
Note that in this notebook, we demonstrate evaluating 5 synthetic datasets generated by 5 different EMR-WGAN models.
Please run privacy_evaluation.ipynb file step by step to evaluate the privacy risk of the generated synthetic EHR datasets.
Multiple common privacy metrics have been included.
Note that in this notebook, we demonstrate evaluating 5 synthetic datasets generated by 5 different EMR-WGAN models.

Select the optimal synthetic dataset aligned with a target use case

Please run rank_datasets.ipynb step by step to select the best dataset generated previously.
One needs to provide a set of weights to the metrics considered so that these weights reflect the degree to which a given use case values the evaluation results in each metric.

Reference to cite

Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A multifaceted benchmarking of synthetic electronic health record generation models. Nature communications. 2022 Dec 9;13(1):7609.

Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association. 2020 Jan;27(1):99-108.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
utils		utils
GAN_generation.py		GAN_generation.py
GAN_training.py		GAN_training.py
README.md		README.md
att_risk_tutorial.py		att_risk_tutorial.py
data_extraction_and_preprocessing_process.ipynb		data_extraction_and_preprocessing_process.ipynb
mem_risk_tutorial.py		mem_risk_tutorial.py
privacy_evaluation.ipynb		privacy_evaluation.ipynb
rank_datasets.ipynb		rank_datasets.ipynb
requirement.txt		requirement.txt
utility_evaluation.ipynb		utility_evaluation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A tutorial for generating and evaluating synthetic health data based on MIMIC-IV V2.0 dataset and EMR-WGAN.

System requirement

OS Requirements

Python (3.7.16) Dependencies

Install all Python dependencies

Descriptions of files

Data preprocessing

GAN training

Synthetic data generation

Data quality evaluation using common metrics

Select the optimal synthetic dataset aligned with a target use case

Reference to cite

About

Releases

Packages

Languages

yanchao0222/tutorial_data_synthesis_and_evaluation

Folders and files

Latest commit

History

Repository files navigation

A tutorial for generating and evaluating synthetic health data based on MIMIC-IV V2.0 dataset and EMR-WGAN.

System requirement

OS Requirements

Python (3.7.16) Dependencies

Install all Python dependencies

Descriptions of files

Data preprocessing

GAN training

Synthetic data generation

Data quality evaluation using common metrics

Select the optimal synthetic dataset aligned with a target use case

Reference to cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages