MIMIC-III preprocessing for Dynamic Measurement Scheduling for Event Forecasting using Deep RL (ICML 2019)
- Run this repo by following the instructions below first (it's cloned around Jan. 2018). But when creating the dataset use mingjie_create_in_hospital_mortality.py to create.
- Follow the official MIMIC repo to run the concept comorbidity with file elixhauser_ahrq (https://github.com/MIT-LCP/mimic-code/tree/master/concepts). You need to setup sql to run this part. After that export the table view to a csv file as elixhauser_ahrq.csv
- Run the 2 notebooks in the notebooks/. Just change the file path pointing toward your generated files and folder.
Then you can run the code part in https://github.com/zzzace2000/autodiagnosis.
Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database. Currently, we are focused on building a multitask learning benchmark dataset that includes four key inpatient clinical prediction tasks that map onto core machine learning problems: prediction of mortality from early admission data (classification), real-time detection of decompensation (time series classification), forecasting length of stay (regression), and phenotype classification (multilabel sequence classification).
- 2017 March 23: We are pleased to announce the first official release of these benchmarks. We expect to release a revision within the coming months that will add at least ~50 additional input variables. We are likewise pleased to announce that the manuscript associated with these benchmarks is now available on arXiv.
If you use this code or these benchmarks in your research, please cite the following publication: Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, and Aram Galstyan. Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv:1703.07771 which is now available on arXiv. This paper is currently under review for SIGKDD and if accepted, the citation will change. Please be sure also to cite the original MIMIC-III paper.
Despite rapid growth in research that applies machine learning to clinical data, progress in the field appears far less dramatic than in other applications of machine learning. In image recognition, for example, the winning error rates in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) plummeted almost 90% from 2010 (0.2819) to 2016 (0.02991). There are many reasonable explanations for this discrepancy: clinical data sets are inherently noisy and uncertain and often small relative to their complexity, and for many problems of interest, ground truth labels for training and evaluation are unavailable.
However, there is another, simpler explanation: practical progress has been difficult to measure due to the absence of community benchmarks like ImageNet. Such benchmarks play an important role in accelerating progress in machine learning research. For one, they focus the community on specific problems and stoke ongoing debate about what those problems should be. They also reduce the startup overhead for researchers moving into a new area. Finally and perhaps most important, benchmarks facilitate reproducibility and direct comparison of competing ideas.
Here we present four public benchmarks for machine learning researchers interested in health care, built using data from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database (paper, website). Our four clinical prediction tasks are critical care variants of four opportunities to transform health care using in "big clinical data" as described in Bates, et al, 2014:
- early triage and risk assessment, i.e., mortality prediction
- prediction of physiologic decompensation
- identification of high cost patients, i.e. length of stay forecasting
- characterization of complex, multi-system diseases, i.e., acute care phenotyping
In Harutyunyan, Khachatrian, Kale, and Galstyan 2017, we propose a multitask RNN architecture to solve these four tasks simultaneously and show that this model generally outperforms strong single task baselines.
We do not provide the MIMIC-III data itself. You must acquire the data yourself from https://mimic.physionet.org/. Specifically, download the CSVs. Otherwise, generally we make liberal use of the following packages:
- numpy
- pandas
For logistic regression baselines sklearn is required. LSTM models use Theano/Lasagne.
Here are the required steps to build the benchmark. It assumes that you already have MIMIC-III dataset (lots of CSV files) on the disk.
-
Clone the repo.
git clone https://github.com/YerevaNN/mimic3-benchmarks/ cd mimic3-benchmarks/
-
Add the path to the
PYTHONPATH
(sorry for this).export PYTHONPATH=$PYTHONPATH:[PATH TO THIS REPOSITORY]
-
The following command takes MIMIC-III CSVs, generates one directory per
SUBJECT_ID
and writes ICU stay information todata/[SUBJECT_ID/stays.csv
, diagnoses todata/[SUBJECT_ID]/diagnoses.csv
, and events todata/[SUBJECT_ID]/events.csv
. This step might take around an hour.python scripts/extract_subjects.py [PATH TO MIMIC-III CSVs] data/root/
-
The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. 4741761 events (80%) remain after removing all suspicious rows.
python scripts/validate_events.py data/root/
-
The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in
[SUBJECT_ID]/episode{#}_timeseries.csv
(where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in[SUBJECT_ID]/episode{#}.csv
. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.).python scripts/extract_episodes_from_subjects.py data/root/
-
The next command splits the whole dataset into training and testing sets. Note that all benchmarks use the same split:
python scripts/split_train_and_test.py data/root/
-
The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.
python scripts/create_in_hospital_mortality.py data/root/ data/in-hospital-mortality/ python scripts/create_decompensation.py data/root/ data/decompensation/ python scripts/create_length_of_stay.py data/root/ data/length-of-stay/ python scripts/create_phenotyping.py data/root/ data/phenotyping/ python scripts/create_multitask.py data/root/ data/multitask/
-
I add my own datasets.
python scripts/mingjie_create_in_hospital_mortality.py data/root/ data/my-mortality/
For each of the 4 main tasks we provide logistic regression and LSTM baselines.
Please note that running linear models can take hours because of extensive grid search. You can change the chunk_size
parameter in codes and they will became faster (of course the performance will not be the same).
Use the following command to extract validation set from the traning set. This step is required for running the baseline models.
python mimic3models/split_train_val.py [TASK]
[TASK]
is either in-hospital-mortality
, decompensation
, length-of-stay
, phenotyping
or multitask
.
Run the following command to train the neural network which gives the best result. We got the best performance on validation set after 8 epochs.
cd mimic3models/in_hospital_mortality/
python -u main.py --network lstm --dim 256 --timestep 2.0 --mode train --batch_size 8 --log_every 30
To test the model use the following:
python -u main.py --network lstm --dim 256 --timestep 2.0 --mode test --batch_size 8 --log_every 30 --load_state best_model.state
Use the following command to train logistic regression. The best model we got used L2 regularization with C=0.001
:
cd mimic3models/in_hospital_mortality/logistic/
python -u main.py --l2 --C 0.001
The best model we got for this task was trained for 110 chunks (that's less than one epoch; it overfits before reaching one epoch because there are many training samples for the same patient with different lengths).
cd mimic3models/decompensation/
python -u main.py --network lstm --dim 256 --mode train --batch_size 8 --log_every 30
Here is the command to test:
python -u main.py --network lstm --dim 256 --mode test --batch_size 8 --log_every 30 --load_state best_model.state
Use the following command to train a logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.
cd mimic3models/decompensation/logistic/
python -u main.py
The best model we got for this task was trained for 15 chunks.
cd mimic3models/length_of_stay/
python -u main.py --network lstm_cf_custom --dim 256 --mode train --batch_size 8 --log_every 30
Run the following command to test the best pretrained neural network.
python -u main.py --network lstm_cf_custom --dim 256 --mode test --batch_size 8 --log_every 30 --load_state best_model.state
Use the following command to train a logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.
cd mimic3models/length_of_stay/logistic/
python -u main_cf.py
The best model we got for this task was trained for 30 epochs.
cd mimic3models/phenotyping/
python -u main.py --network lstm_2layer --dim 512 --mode train --batch_size 8 --log_every 30
Use the following command for testing:
python -u main.py --network lstm_2layer --dim 512 --mode test --batch_size 8 --log_every 30 --load_state best_model.state
Use the following command for logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.
cd mimic3models/phenotyping/logistic/
python -u main.py
ihm_C
, decomp_C
, los_C
and ph_C
coefficients control the relative weight of the tasks in the multitask model. Default is 1.0
. The best model we got was trained for 12 epochs.
cd mimic3models/multitask/
python -u main.py --network lstm --dim 1024 --mode train --batch_size 8 --log_every 30 --ihm_C 0.02 --decomp_C 0.1 --los_C 0.5
Use the following command for testing:
python -u main.py --network lstm --dim 1024 --mode test --batch_size 8 --log_every 30 --load_state best_model.state
- Test and debug
- Add comments and documentation
- Refactor, where appropriate, to make code more generally useful
- Expand coverage of variable map and variable range files.
- Decide whether we are missing any other high-priority data (CPT codes, inputs, etc.)
Here are the problems identified by validate_events.py
on randomly chosen 1000 subjects:
Type | Description | Number of rows |
---|---|---|
n_events |
total number of events | 5937206 |
nohadminstay |
HADM_ID does not appear in stays.csv |
836341 |
emptyhadm |
HADM_ID is empty | 126480 |
icustaymissinginstays |
ICUSTAY_ID does not appear in stays.csv |
232624 |
noicustay |
ICUSTAY_ID is empty | 347768 |
recovered |
empty ICUSTAY_IDs are recovered according to stays.csv files (given HADM_ID ) |
347768 |
couldnotrecover |
empty ICUSTAY_IDs that are not recovered. This should be zero, because the unrecoverable ones are counted in icustaymissinginstays |
0 |