ContextGem-based extractor that pulls immune checkpoint inhibitor (ICI) drug exposures and AKI context from clinical notes. Works with a local vLLM-hosted OSS-120B.
# install
pip install -U contextgem pandas pyarrow tqdm requests
export OPENAI_API_KEY=not-needed
# init oss endpoint
sbatch serve.sbatch
# run pipeline
python run.py --notes `path_to_notes_parquet`
# or only those from the drug list
python run.py --notes `path_to_notes_parquet` --filter-ici
Parquet columns: SERVICE_NAME, PHYSIOLOGIC_TIME, OBFUSCATED_GLOBAL_PERSON_ID, ENCOUNTER_ID, REPORT_TEXT
outputs/.../drug_mentions.parquet (one row per ICI exposure)
- patient_id(Int64), encounter_id(Int64), note_id(str), note_date(ts), note_type(str)
- drug_text(str), normalized_name(str), canonical_generic(str|null), class(str|null)
- rxcui(str|null), dose_text(str|null), route_text(str|null), when_text(str|null)
- sentence(str)
outputs/.../note_concepts.parquet (one row per note)
- patient_id(Int64), encounter_id(Int64), note_id(str), note_date(ts), note_type(str)
- has_alt_cause(bool), n_drug_exposures_raw(int), n_drug_exposures(int), n_aki_mentions(int)
- concepts_json(str) with keys: drug_exposures, aki_mentions, attributions, alt_causes
- --debug process first 10 notes
- --offset N skip first N notes then process
- --filter-ici keep only drug exposures that match the ICI vocab