Resourses of pre-trained language models on clinical texts.
As of July 8, 2019, the following models have been made available:
-
ELMo
Each
.tar.gz
file contains two items: a.json
file of pre-training architecture and a.hdf5
file of pre-trained weights. -
BERT
-
Large Cased Models
-
Base Cased Models
Each
.tar.gz
file contains a TensorFlow checkpoint (model.ckpt.*
) including the pre-trained weights (which is actually 3 files). We followed the authors' detailed instructions to set up the pre-training parameters therefore the pre-training architecture filesbert_config.json
are the same with released BERT models respectively.The vocabulary list (
vocab.txt
) released by Google Team consisting of 28,996 word-pieced tokens is also adopted. -
We are grateful to the authors of BERT and ELMo to make the pre-training codes and instructions publicly available. We are also thankful to the MIMIC-III team for providing valuable resources about clinical text. Please follow the instructions
to get the access of MIMIC-III data before downloading the above pre-trained models.
If you use models available in this repository, we would be grateful if you would cite the paper as follows:
- Si, Yuqi, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. “Enhancing Clinical Concept Extraction with Contextual Embeddings.” Journal of the American Medical Informatics Association, July, ocz096. https://doi.org/10.1093/jamia/ocz096.
@article{si_enhancing_2019,
title = {Enhancing clinical concept extraction with contextual embeddings},
issn = {1527-974X},
url = {https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocz096/5527248},
doi = {10.1093/jamia/ocz096},
language = {en},
urldate = {2019-07-09},
journal = {Journal of the American Medical Informatics Association},
author = {Si, Yuqi and Wang, Jingqi and Xu, Hua and Roberts, Kirk},
month = jul,
year = {2019},
pages = {ocz096}
}