Summary: As artificial intelligence (AI) rapidly approaches human-level performance in medical imaging, it is crucial that it does not exacerbate or propagate healthcare disparities. Prior research has established AI’s capacity to infer demographic data from chest X-rays, leading to a key concern: do models using demographic shortcuts have unfair predictions across subpopulations? In this study, we conduct a thorough investigation into the extent to which medical AI utilizes demographic encodings, focusing on potential fairness discrepancies within both in-distribution training sets and external test sets. Our analysis covers three key medical imaging disciplines: radiology, dermatology, and ophthalmology, and incorporates data from six global chest X-ray datasets. We confirm that medical imaging AI leverages demographic shortcuts in disease classification. While correcting shortcuts algorithmically effectively addresses fairness gaps to create "locally optimal" models within the original data distribution, this optimality is not true in new test settings. Surprisingly, we find that models with less encoding of demographic attributes are often most "globally optimal", exhibiting better fairness during model evaluation in new test environments. Our work establishes best practices for medical imaging models which maintain their performance and fairness in deployments beyond their initial training contexts, underscoring critical considerations for AI clinical deployments across populations and sites.
To download all the datasets used in this study, please follow instructions in DataSources.md.
As the original image files are often high resolution, we cache the images as downsampled copies to speed training up for certain datasets. To do so, run
python -m scripts.cache_cxr --data_path <data_path> --dataset <dataset>
where datasets can be mimic
, vindr
, siim
, isic
, or odir
. This process is required for vindr
and siim
, and is optional for the remaining datasets.
python -m train \
--algorithm <algo> \
--dataset <dset> \
--task <task> \
--attr <attr> \
--data_dir <data_path> \
--output_dir <output_path>
-
Use
sweep_grid.py
to runtrain.py
with experiments inexperiments.py
to train modelspython -m sweep_grid launch \ --experiment <exp_train> \ ...
-
Use
compute_optimal_thres.ipynb
to generate optimal thresholds based on F1-score maximization. We provide the thresholds used in the study (the output of this notebook) inopt_thres.pkl
. -
Use
sweep_grid.py
to runeval.py
withexperiments.py
to evaluate trained models at the best thresholds. Be sure to use the--no_output_dir
argument when callingsweep_grid.py
python -m sweep_grid launch \ --experiment <exp_eval> \ --no_output_dir \ ...
This code is partly based on the open-source implementations from SubpopBench.
If you find this code or idea useful, please cite our work:
@article{yang2024limits,
title={The limits of fair medical imaging AI in real-world generalization},
author={Yang, Yuzhe and Zhang, Haoran and Gichoya, Judy W and Katabi, Dina and Ghassemi, Marzyeh},
journal={Nature Medicine},
pages={1--11},
year={2024},
publisher={Nature Publishing Group US New York}
}
If you have any questions, feel free to contact us through email (yuzhe@mit.edu & haoranz@mit.edu) or GitHub issues. Enjoy!