Skip to content

yanchao0222/tutorial_data_synthesis_and_evaluation

Repository files navigation

A tutorial for generating and evaluating synthetic health data based on MIMIC-IV V2.0 dataset and EMR-WGAN.

This repository is paired with the following tutorial paper:

Yan C, Zhang Z, Nyemba S, Li Z. Generating synthetic electronic health record data using generative adversarial networks: A tutorial.

System requirement

OS Requirements

This package is supported for Linux. The package has been tested on the following systems:

  • Linux: Ubuntu 20.04

Python (3.7.16) Dependencies

tensorflow
pandas
numpy
pandas
argparse
matplotlib
scipy
sklearn
lightgbm
joblib
shap
requests

Install all Python dependencies

pip install -r requirements.txt

Descriptions of files

Data preprocessing

Before this step, one needs to download MIMIC-IV V2.0 data at https://physionet.org/content/mimiciv/2.0/ by completing the required application and training steps.

Please run data_extraction_and_preprocessing_github.ipynb file step by step to prepare the dataset for the subsequent GAN training.

GAN training

Train an EMR-WGAN model by specifying gpu_id and model_id and then running

python GAN_training.py --gpu_id xx --model_id xx

Synthetic data generation

Generate synthetic data from a trained EMR-WGAN model by specifying gpu_id, model_id, and load_checkpoint (ie, checkpoint id). We recommend running the training multiple times to select the optimal one based on the subsequent evaluation results.

python GAN_generation.py --gpu_id xx --model_id xx --load_checkpoint xx

Data quality evaluation using common metrics

  • Please run utility_evaluation.ipynb file step by step to evaluate the utility of the generated synthetic EHR datasets.

  • Multiple common utility metrics have been included.

  • Note that in this notebook, we demonstrate evaluating 5 synthetic datasets generated by 5 different EMR-WGAN models.

  • Please run privacy_evaluation.ipynb file step by step to evaluate the privacy risk of the generated synthetic EHR datasets.

  • Multiple common privacy metrics have been included.

  • Note that in this notebook, we demonstrate evaluating 5 synthetic datasets generated by 5 different EMR-WGAN models.

Select the optimal synthetic dataset aligned with a target use case

  • Please run rank_datasets.ipynb step by step to select the best dataset generated previously.
  • One needs to provide a set of weights to the metrics considered so that these weights reflect the degree to which a given use case values the evaluation results in each metric.

Reference to cite

Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A multifaceted benchmarking of synthetic electronic health record generation models. Nature communications. 2022 Dec 9;13(1):7609.

Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association. 2020 Jan;27(1):99-108.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published