Thinking, Fast and Slow: Knowledge Extraction To Facilitate Phenotyping Using Drug Records in Real-World Data 💊
A framework for better leveraging large language models (LLMs) to use real-world medication records for improved phenotyping. We demonstrate that LLM-derived drug-disease probabilities greatly boost downstream phenotyping tasks, significantly outperforming traditional feature engineering approaches.
This project introduces a creative approach to incorporate medication data into clinical ML pipelines by:
- DualR Score Generation: Extracting drug-disease conditional probabilities
P(disease|drug)
from LLMs using structured prompting - Log-odds Aggregation: Converting multi-drug patient profiles into unified risk scores using dataset-specific thresholds
- Comprehensive Evaluation: Comparing against traditional methods including Poly-drug Risk Score (PdRS), LLM embeddings, and baseline approaches
Method | Description | Performance |
---|---|---|
DualR 🏆 | LLM-derived P(disease|drug) with log-odds aggregation | Best |
Embeddings | Direct LLM embeddings with dimensionality reduction | Maching-up |
PdRS | Poly-drug Risk Score: Statistical drug-outcome associations from EMR data | Strong (see partial ground truth) |
Baseline | Traditional feature engineering without drug information | Baseline |
- T2D: Type 2 diabetes mellitus
- BRC: Breast cancer (female)
- HTN: Hypertension
nova/
├── dualr.py # 🧠 llm drug-disease probability extraction
├── dualr_post.py # 🔗 patient-level dualr score aggregation
├── pdrs.py # 📈 statistical drug-outcome modeling
├── emb.py # 🔢 drug embedding generation and dimension reduction
├── ml.py # 🔬 ml pipeline and evaluation
├── data.yaml # 📋 All of Us data configuration
├── dualr/{disease}/ # 💾 cached dualr scores by disease/model
├── pdrs/results/ # 📊 cross-validated pdrs coefficients
└── emb/{method}/ # 🎯 reduced embeddings (pca/umap/autoencoder)
This project uses two separate computing environments:
# standalone environment for statistical modeling
pip install -r requirements_dualr.txt
# generate drug-disease probabilities
python dualr.py --model_name meta-llama/Llama-3.1-8B-Instruct --disease t2d --cot
# debug mode (first 5 drugs)
python dualr.py --model_name meta-llama/Llama-3.1-8B-Instruct --disease t2d --debug
# main environment for llm processing and evaluation
pip install -r requirements_aou.txt # combines pdrs_ml + emb requirements
# combine individual drug probabilities into patient scores
python dualr_post.py --disease t2d --cohort aou \
--dualr_no_cot_path dualr/t2d_llama-3.1-8b-instruct_seed42.parquet \
--dualr_cot_path dualr/t2d_llama-3.1-8b-instruct_cot_seed42.parquet \
--model_name meta-llama/Llama-3.1-8B-Instruct
# compute statistical drug-outcome associations in standalone environment
python pdrs.py --disease t2d
# generate reduced embeddings
python emb.py --model meta-llama/Llama-3.1-8B --method pca --dim 2
# run comprehensive comparison across all methods
python ml.py --disease t2d --run_experiments --generate_summary
The DualR module extracts structured knowledge from LLMs using:
- Multi-seed retry strategy: Automatically retries failed predictions with different seeds (base_seed + round*1000)
- Robust checkpointing: Saves progress every 2 batches with resume capability
- GPU optimization: Multi-GPU tensor parallelism and quantization support
- Chain-of-thought reasoning: Optional CoT prompting for improved accuracy
Argument | Default | Description |
---|---|---|
--model_name |
- | hugging face model identifier |
--disease |
- | disease type (t2d|brc|htn) |
--cot |
false | enable chain-of-thought reasoning |
--num_gpus |
4 | gpus for tensor parallelism |
--temperature |
0.6 | sampling temperature |
--max_retries |
20 | maximum retry rounds with different seeds |
- Round 1: Process all drugs with base seed (42), collect failures
- Round 2+: Retry failed drugs with offset seeds (1042, 2042, ...)
- Continue: Up to 20 rounds until all drugs succeed or exhaust retries
- Checkpoint: Save progress after each round for robust recovery
PdRS provides statistical drug-outcome associations using cross-validation (computed on Montana State University HPC Tempest):
- Fits logistic regression:
P(drug = 1) = logit^(-1)(β₀ + β₁ * standardized_disease)
- Handles both binary and multi-level disease severity measures
- Uses Bonferroni correction for multiple testing
- Generates volcano plots for effect visualization
Supports multiple embedding approaches:
- vLLM: Fast embedding generation from generative models (LLaMA, Mistral)
- HuggingFace: BERT-like encoder models (Gatortron)
- Dimensionality Reduction: PCA, UMAP, or Autoencoder compression
- Dynamic Context: Estimates optimal max_model_len from text distribution
The data.yaml
file specifies data paths hosted on AOU platform to avoid exposure of data buckets and PII. Key sections:
features.emr
: patient medication records{disease}.cohort
: disease-specific cohortsdualr.{disease}.{model}
: cached probability scorespdrs.{disease}.cv_dir
: cross-validation resultsemb.{method}.{model}
: reduced embeddings
- Setup Environment: Install requirements for each component
- Generate DualR Scores: Run
dualr.py
for each disease with/without CoT (on any HPCs since no DualR raw scores are not specific to individuals) - Create Patient Features: Use
dualr_post.py
to aggregate patient-level scores (on All of Us) - Calculate PdRS: Run
pdrs.py
for statistical benchmarks (on All of Us) - Generate Embeddings: Use
emb.py
with desired reduction method (on All of Us) - Run Experiments: Execute
ml.py
for comprehensive evaluation (on All of Us) - Compare Results: Review generated summaries and visualizations (on All of Us)
results/
├── ml/{disease}/experiments/ # individual experiment results
├── ml/{disease}/summaries/ # method comparison tables
├── dualr/{disease}/{model}/ # aggregated patient scores
├── pdrs/results/{disease}/ # cross-validated coefficients
└── emb/{method}/ # reduced embeddings by technique
For reproduction, contact Haining Wang (hw56@iu.edu)
For data questions, contact Chenxi Xiong (xiongc@iu.edu)
For general questions, contact PI Dr. Jing Su (su1@iu.edu)
MIT