This repository contains additional materials of our papers "Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning" (Bioinformatics) and "Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer" (ECIR 2021). We investigate the effectiveness of transferring concept normalization from the general biomedical domain to the clinical trials domain in a zero-shot setting with an absence of labeled data. We propose a simple and effective two-stage neural approach based on fine-tuned BERT architectures. In the first stage, we train a metric learning model that optimizes relative similarity of mentions and concepts via triplet loss. The model is trained on available labeled corpora of scientific abstracts to obtain vector embeddings of concept names and entity mentions from texts. In the second stage, we find the closest concept name representation in an embedding space to a given clinical mention. We evaluated several models, including state-of-the-art architectures, on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. Extensive experiments validate the effectiveness of our approach in knowledge transfer from the scientific literature to clinical trials.
Table 1 Out-of-domain performance of the proposed DILBERT model andbaselines in terms of Acc@1 on the filtered test set of clinical trials (CT)
Model | CT Condition | CT Intervention | ||
---|---|---|---|---|
single concept | full set | single concept | full set | |
BioBERT ranking | 72.6 | 71.74 | 77.83 | 56.97 |
BioSyn | 86.36 | - | 79.58 | - |
DILBERT with different ranking strategies | ||||
random sampling | 85.73 | 84.85 | 82.54 | 81.16 |
random + 2 parents | 86.74 | 86.36 | 81.84 | 79.14 |
random + 5 parents | 87.12 | 86.74 | 81.67 | 79.14 |
resampling | 85.22 | 84.63 | 81.67 | 80.21 |
resampling + 5 siblings | 84.84 | 84.26 | 80.62 | 76.16 |
Table 2 In-domain performance of the proposed DILBERT model interms of Acc@1 on the refined test set of the Biocreative V CDR corpus. For more details about the refined CDR corpus, please see our paper "Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models"
Model | CDR Disease | CDR Chemical |
---|---|---|
BioBERT ranking | 66.4 | 80.7 |
BioSyn | 74.1 | 83.8 |
DILBERT, random sampling | 75.5 | 81.4 |
DILBERT, random + 2 parents | 75.0 | 81.2 |
DILBERT, random + 5 parents | 73.5 | 81.4 |
DILBERT, resampling | 75.8 | 83.3 |
DILBERT, resampling + 5 siblings | 75.3 | 82.1 |
Figure 1 In-domain performance of the proposed DILBERT model in terms of Acc@1 on the refined test set of the Biocreative V CDR corpus using reduced dictionaries.
$ pip install -r requirements.txt
We use the Huggingface version of BioBERT v1.1 so that the pretrained model can be run on the pytorch framework.
We made available all datasets
To run the full training and evaluation procedure use run.sh script.
$ ./run.sh
$ python data_utils/convert_to_triplet_dataset.py --input_data path/to/labeled/files \
--vocab path/to/vocabulary \
--save_to path/to/save/triplets/file \
--path_to_bert_model path/to/bert/model \
--hard \
--hierarchy path/to/hierarchy/file \
--hierarchy_aware
$ python train_sentence_bert.py --path_to_bert_model path/to/bert/model \
--data_folder path/to/folder/containing/triplet/file \
--triplets_file triplet_file_name \
--output_dir path/to/save/model
To eval the model run the command:
$ python eval_bert_ranking.py --model_dir path/to/bert/model \
--data_folder path/to/labeled/files \
--vocab path/to/vocabulary
Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer //Advances in Information Retrieval. – 2021. – pp. 451-466. paper, preprint
@InProceedings{10.1007/978-3-030-72113-8_30,
author="Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena",
title="Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer",
booktitle="Advances in Information Retrieval",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="451--466",
isbn="978-3-030-72113-8"
}
Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Medical concept normalization in clinical trials with drug and disease representation learning //Bioinformatics. – 2021. – Т. 37. – №. 21. – pp. 3856-3864. paper
@article{10.1093/bioinformatics/btab474,
author = {Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena},
title = "{Medical concept normalization in clinical trials with drug and disease representation learning}",
journal = {Bioinformatics},
volume = {37},
number = {21},
pages = {3856-3864},
year = {2021},
month = {07},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btab474},
url = {https://doi.org/10.1093/bioinformatics/btab474},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/21/3856/41091512/btab474.pdf},
}
Tutubalina E., Kadurin A., Miftahutdinov Z. Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models //Proceedings of the 28th International Conference on Computational Linguistics. – 2020. – pp. 6710-6716. paper, git
@inproceedings{tutubalina2020fair,
title={Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models},
author={Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
pages={6710--6716},
year={2020}
}