Active Learning setting for medical documents embeddings

This repository contains the complementary material used for Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting.

Dataset

Dataset can be downloaded here.

Files included in the dataset are:

Active Learning Epistemonikos raw results
Active Learning HealthCLEF raw results
CLEF active learning ML pre-trained models
Epistemonikos active learning ML pre-trained models
CLEF BERT embeddings
CLEF BioBERT embeddings
CLEF Word2Vec embeddings
CLEF TF-IDF representations
CLEF GloVE embeddings
Epistemonikos BERT embeddings
Epistemonikos BioBERT embeddings
Epistemonikos Word2Vec embeddings
Epistemonikos TF-IDF representations
Epistemonikos GloVE embeddings

To run scripts download all these files into directory $DATASET_DIR.

Training Machine Learning Models from scratch using Active Learning

We assume the dataset has been downloaded at $DATASET_DIR.

Unzip all files.
Choose Epistemonikos Active Learning or HealthCLEF Active Learning to start training.
Instructions to run each script are included in each Jupyter notebook.

PD. To replicate results from relevance feedback, documents and medical questions need to be indexed in ElasticSearch.

Use pre-trained Machine Learning models

We assume the dataset has been downloaded at $DATASET_DIR.

Unzip all files.
Choose Epistemonikos pre-trained models or HealthCLEF pre-trained models.
Instructions to run each script are included in each Jupyter notebook.

Use raw prediction files

We assume the dataset has been downloaded at $DATASET_DIR.

Unzip all files.
Run raw predictions Epistemonikos or raw predictions HealthCLEF.
Instructions to run each script are included in each Jupyter notebook.

Replicate results plots

To replicate plots reported in the paper run plot results HealthCLEF or plot results Epistemonikos.

Ilustration of the Active Learning approach

It starts with a set of candidate documents which based on an active learning strategy (uncertainty or random sampling) are retrieved to be labeled. Then the oracle (domain expert) adds new labels, the system uses the labels to train a machine learning model, and next it makes predictions with the latest model trained. Predictions are used to sample the new set of candidate documents.

Results

dataset	embedding	model	active learning	recall@10	MAP	lastrel%
Epistemonikos	Word2vec	Logistic Regression	Uncertainty Sampling	.717	.768	14.8%
HealthCLEF	BioBERT	Random Forest	Uncertainty Sampling	.571	.910	4.5%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Active Learning setting for medical documents embeddings

Dataset

Training Machine Learning Models from scratch using Active Learning

Use pre-trained Machine Learning models

Use raw prediction files

Replicate results plots

Ilustration of the Active Learning approach

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Active Learning setting for medical documents embeddings

Dataset

Training Machine Learning Models from scratch using Active Learning

Use pre-trained Machine Learning models

Use raw prediction files

Replicate results plots

Ilustration of the Active Learning approach

Results