Skip to content

Su-informatics-lab/nova

Repository files navigation

Thinking, Fast and Slow: Knowledge Extraction To Facilitate Phenotyping Using Drug Records in Real-World Data 💊

A framework for better leveraging large language models (LLMs) to use real-world medication records for improved phenotyping. We demonstrate that LLM-derived drug-disease probabilities greatly boost downstream phenotyping tasks, significantly outperforming traditional feature engineering approaches.

🎯 Overview

This project introduces a creative approach to incorporate medication data into clinical ML pipelines by:

  1. DualR Score Generation: Extracting drug-disease conditional probabilities P(disease|drug) from LLMs using structured prompting
  2. Log-odds Aggregation: Converting multi-drug patient profiles into unified risk scores using dataset-specific thresholds
  3. Comprehensive Evaluation: Comparing against traditional methods including Poly-drug Risk Score (PdRS), LLM embeddings, and baseline approaches

📊 Methodology Comparison

Method Description Performance
DualR 🏆 LLM-derived P(disease|drug) with log-odds aggregation Best
Embeddings Direct LLM embeddings with dimensionality reduction Maching-up
PdRS Poly-drug Risk Score: Statistical drug-outcome associations from EMR data Strong (see partial ground truth)
Baseline Traditional feature engineering without drug information Baseline

🏥 Supported Diseases

  • T2D: Type 2 diabetes mellitus
  • BRC: Breast cancer (female)
  • HTN: Hypertension

📁 Project Structure

nova/
├── dualr.py              # 🧠 llm drug-disease probability extraction
├── dualr_post.py         # 🔗 patient-level dualr score aggregation  
├── pdrs.py               # 📈 statistical drug-outcome modeling
├── emb.py                # 🔢 drug embedding generation and dimension reduction
├── ml.py                 # 🔬 ml pipeline and evaluation
├── data.yaml             # 📋 All of Us data configuration 
├── dualr/{disease}/      # 💾 cached dualr scores by disease/model
├── pdrs/results/         # 📊 cross-validated pdrs coefficients
└── emb/{method}/         # 🎯 reduced embeddings (pca/umap/autoencoder)

🚀 Quick Start

Requirements

This project uses two separate computing environments:

1. Montana State University HPC Tempest (DualR raw probabilities derivation)

# standalone environment for statistical modeling
pip install -r requirements_dualr.txt

# generate drug-disease probabilities  
python dualr.py --model_name meta-llama/Llama-3.1-8B-Instruct --disease t2d --cot

# debug mode (first 5 drugs)
python dualr.py --model_name meta-llama/Llama-3.1-8B-Instruct --disease t2d --debug

2. All of Us Platform (DualR Aggregation, PdRS, Embeddings, ML Pipeline)

# main environment for llm processing and evaluation
pip install -r requirements_aou.txt    # combines pdrs_ml + emb requirements
1. Patient-Level DualR Aggregation
# combine individual drug probabilities into patient scores
python dualr_post.py --disease t2d --cohort aou \
    --dualr_no_cot_path dualr/t2d_llama-3.1-8b-instruct_seed42.parquet \
    --dualr_cot_path dualr/t2d_llama-3.1-8b-instruct_cot_seed42.parquet \
    --model_name meta-llama/Llama-3.1-8B-Instruct
3. PdRS Calculation (Montana State University HPC Tempest)
# compute statistical drug-outcome associations in standalone environment
python pdrs.py --disease t2d
4. Drug Embeddings
# generate reduced embeddings
python emb.py --model meta-llama/Llama-3.1-8B --method pca --dim 2
5. ML Pipeline Evaluation
# run comprehensive comparison across all methods
python ml.py --disease t2d --run_experiments --generate_summary

🔧 DualR Generation Details

The DualR module extracts structured knowledge from LLMs using:

Key Features

  • Multi-seed retry strategy: Automatically retries failed predictions with different seeds (base_seed + round*1000)
  • Robust checkpointing: Saves progress every 2 batches with resume capability
  • GPU optimization: Multi-GPU tensor parallelism and quantization support
  • Chain-of-thought reasoning: Optional CoT prompting for improved accuracy

Key Arguments

Argument Default Description
--model_name - hugging face model identifier
--disease - disease type (t2d|brc|htn)
--cot false enable chain-of-thought reasoning
--num_gpus 4 gpus for tensor parallelism
--temperature 0.6 sampling temperature
--max_retries 20 maximum retry rounds with different seeds

Multi-Seed Strategy

  1. Round 1: Process all drugs with base seed (42), collect failures
  2. Round 2+: Retry failed drugs with offset seeds (1042, 2042, ...)
  3. Continue: Up to 20 rounds until all drugs succeed or exhaust retries
  4. Checkpoint: Save progress after each round for robust recovery

📈 PdRS (Poly-drug Risk Score)

PdRS provides statistical drug-outcome associations using cross-validation (computed on Montana State University HPC Tempest):

  • Fits logistic regression: P(drug = 1) = logit^(-1)(β₀ + β₁ * standardized_disease)
  • Handles both binary and multi-level disease severity measures
  • Uses Bonferroni correction for multiple testing
  • Generates volcano plots for effect visualization

🎯 Drug Embeddings

Supports multiple embedding approaches:

  • vLLM: Fast embedding generation from generative models (LLaMA, Mistral)
  • HuggingFace: BERT-like encoder models (Gatortron)
  • Dimensionality Reduction: PCA, UMAP, or Autoencoder compression
  • Dynamic Context: Estimates optimal max_model_len from text distribution

🔄 Reproduction & Data

Data Configuration

The data.yaml file specifies data paths hosted on AOU platform to avoid exposure of data buckets and PII. Key sections:

  • features.emr: patient medication records
  • {disease}.cohort: disease-specific cohorts
  • dualr.{disease}.{model}: cached probability scores
  • pdrs.{disease}.cv_dir: cross-validation results
  • emb.{method}.{model}: reduced embeddings

Reproduction Steps

  1. Setup Environment: Install requirements for each component
  2. Generate DualR Scores: Run dualr.py for each disease with/without CoT (on any HPCs since no DualR raw scores are not specific to individuals)
  3. Create Patient Features: Use dualr_post.py to aggregate patient-level scores (on All of Us)
  4. Calculate PdRS: Run pdrs.py for statistical benchmarks (on All of Us)
  5. Generate Embeddings: Use emb.py with desired reduction method (on All of Us)
  6. Run Experiments: Execute ml.py for comprehensive evaluation (on All of Us)
  7. Compare Results: Review generated summaries and visualizations (on All of Us)

📋 Output Structure

results/
├── ml/{disease}/experiments/     # individual experiment results  
├── ml/{disease}/summaries/       # method comparison tables
├── dualr/{disease}/{model}/      # aggregated patient scores
├── pdrs/results/{disease}/       # cross-validated coefficients  
└── emb/{method}/                 # reduced embeddings by technique

📞 Contact

For reproduction, contact Haining Wang (hw56@iu.edu)
For data questions, contact Chenxi Xiong (xiongc@iu.edu)
For general questions, contact PI Dr. Jing Su (su1@iu.edu)

License

MIT

About

Medication enhanced phenotyping on AoU (v7) and INPC data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published