CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results
@article{yazdani2024ct,
title={CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results},
author={Yazdani, Anthony and Bornet, Alban and Zhang, Boya and Khlebnikov, Philipp and Amini, Poorya and Teodoro, Douglas},
journal={arXiv preprint arXiv:2404.12827},
year={2024}
}
- Operating System: Ubuntu 22.04.3 LTS
- Kernel: Linux 4.18.0-513.18.1.el8_9.x86_64
- Architecture: x86_64
- Python:
- 3.10.12
- Set up your environment and install the necessary Python libraries as specified in
requirements.txt
. Note that you will need to install the development versions of certain libraries from their respective Git repositories. - Place your unzipped MedDRA files in the directory
./data/MedDRA_25_0_English
and your DrugBank XML database in the directory./data/drugbank
.
Ensure you clone and install the following libraries directly from their Git repositories for the development versions:
.
├── a0_download_clinical_trials.py
├── a1_extract_completed_or_terminated_interventional_results_clinical_trials.py
├── a2_extract_and_preprocess_monopharmacy_clinical_trials.py
├── b0_download_pubchem_cids.py
├── b1_download_pubchem_cid_details.py
├── c0_extract_drugbank_dbid_details.py
├── d0_extract_chembl_approved_CHEMBL_details.py
├── data
│ ├── MedDRA_25_0_English
│ │ └── empty.null
│ ├── chembl_approved
│ │ └── empty.null
│ ├── chembl_usan
│ │ └── empty.null
│ ├── clinicaltrials_gov
│ │ └── empty.null
│ ├── drugbank
│ │ └── empty.null
│ └── pubchem
│ └── empty.null
├── e0_extract_chembl_usan_CHEMBL_details.py
├── f0_create_unified_chemical_database.py
├── g0_create_ct_ade_raw.py
├── g1_create_ct_ade_meddra.py
├── g2_create_ct_ade_classification_datasets.py
├── g3_create_ct_ade_friendly_labels.py
├── modeling
│ ├── DLLMs
│ │ ├── config.py
│ │ ├── custom_metrics.py
│ │ ├── model.py
│ │ ├── train.py
│ │ └── utils.py
│ └── GLLMs
│ ├── config-llama3.py
│ ├── config-meditron.py
│ ├── config-openbiollm.py
│ ├── config.py
│ ├── train_S.py
│ ├── train_SG.py
│ └── train_SGE.py
├── requirements.txt
└── src
└── meddra_graph.py
You can download the publicly available CT-ADE-SOC and CT-ADE-PT versions from HuggingFace. These datasets contain standardized annotations from ClinicalTrials.gov:
Alternatively, the datasets are also available on Figshare:
The above datasets are identical to the SOC and PT versions you will produce in the Typical Pipeline from Checkpoint
section.
Follow this procedure if you aim to recreate the dataset detailed in our paper (CT-ADE-SOC, CT-ADE-PT).
Place your unzipped MedDRA files in the directory ./data/MedDRA_25_0_English
and your DrugBank XML database in the directory ./data/drugbank
.
Download chembl_approved, chembl_usan, clinicaltrials_gov, pubchem
files and place them accordingly.
Extract drug details from the DrugBank database.
python c0_extract_drugbank_dbid_details.py
Create a unified database combining information from PubChem, DrugBank, and ChEMBL.
python f0_create_unified_chemical_database.py
Generate the raw CT-ADE dataset from the processed clinical trials data.
python g0_create_ct_ade_raw.py
Annotate the CT-ADE dataset with MedDRA terms.
python g1_create_ct_ade_meddra.py
Generate the final classification datasets for modeling.
python g2_create_ct_ade_classification_datasets.py
As an optional step, you can create a version of the dataset where MedDRA codes are replaced with user-friendly text labels. To do this, run the following command:
python g3_create_ct_ade_friendly_labels.py
Navigate to the modeling/DLLMs
directory and run the training scripts with the desired configuration.
cd modeling/DLLMs
For single-GPU training, use this command:
export CUDA_VISIBLE_DEVICES="0"; \
export MIXED_PRECISION="bf16"; \
FIRST_GPU=$(echo $CUDA_VISIBLE_DEVICES | cut -d ',' -f 1); \
BASE_PORT=29500; \
PORT=$(( $BASE_PORT + $FIRST_GPU )); \
accelerate launch \
--mixed_precision=$MIXED_PRECISION \
--num_processes=$(( $(echo $CUDA_VISIBLE_DEVICES | grep -o "," | wc -l) + 1 )) \
--num_machines=1 \
--dynamo_backend=no \
--main_process_port=$PORT \
train.py
For multi-GPU training, use this command:
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"; \
export MIXED_PRECISION="bf16"; \
FIRST_GPU=$(echo $CUDA_VISIBLE_DEVICES | cut -d ',' -f 1); \
BASE_PORT=29500; \
PORT=$(( $BASE_PORT + $FIRST_GPU )); \
accelerate launch \
--mixed_precision=$MIXED_PRECISION \
--num_processes=$(( $(echo $CUDA_VISIBLE_DEVICES | grep -o "," | wc -l) + 1 )) \
--num_machines=1 \
--dynamo_backend=no \
--main_process_port=$PORT \
train.py
Navigate to the modeling/GLLMs
directory and run the training scripts for different configurations.
cd modeling/GLLMs
Example configurations for LLama3, OpenBioLLM, and Meditron are provided in the folder. You can copy the desired configuration into config.py
and adjust it to your convenience. Next, you can execute the following for the SGE configuration:
python train_SGE.py