This repository contains the complementary material used for Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting.
Dataset can be downloaded here.
Files included in the dataset are:
- Active Learning Epistemonikos raw results
- Active Learning HealthCLEF raw results
- CLEF active learning ML pre-trained models
- Epistemonikos active learning ML pre-trained models
- CLEF BERT embeddings
- CLEF BioBERT embeddings
- CLEF Word2Vec embeddings
- CLEF TF-IDF representations
- CLEF GloVE embeddings
- Epistemonikos BERT embeddings
- Epistemonikos BioBERT embeddings
- Epistemonikos Word2Vec embeddings
- Epistemonikos TF-IDF representations
- Epistemonikos GloVE embeddings
To run scripts download all these files into directory $DATASET_DIR
.
We assume the dataset has been downloaded at $DATASET_DIR
.
- Unzip all files.
- Choose Epistemonikos Active Learning or HealthCLEF Active Learning to start training.
- Instructions to run each script are included in each Jupyter notebook.
PD. To replicate results from relevance feedback, documents and medical questions need to be indexed in ElasticSearch.
We assume the dataset has been downloaded at $DATASET_DIR
.
- Unzip all files.
- Choose Epistemonikos pre-trained models or HealthCLEF pre-trained models.
- Instructions to run each script are included in each Jupyter notebook.
We assume the dataset has been downloaded at $DATASET_DIR
.
- Unzip all files.
- Run raw predictions Epistemonikos or raw predictions HealthCLEF.
- Instructions to run each script are included in each Jupyter notebook.
To replicate plots reported in the paper run plot results HealthCLEF or plot results Epistemonikos.
It starts with a set of candidate documents which based on an active learning strategy (uncertainty or random sampling) are retrieved to be labeled. Then the oracle (domain expert) adds new labels, the system uses the labels to train a machine learning model, and next it makes predictions with the latest model trained. Predictions are used to sample the new set of candidate documents.
dataset | embedding | model | active learning | recall@10 | MAP | lastrel% |
---|---|---|---|---|---|---|
Epistemonikos | Word2vec | Logistic Regression | Uncertainty Sampling | .717 | .768 | 14.8% |
HealthCLEF | BioBERT | Random Forest | Uncertainty Sampling | .571 | .910 | 4.5% |