Skip to content

Latest commit

 

History

History
82 lines (49 loc) · 3.29 KB

README.md

File metadata and controls

82 lines (49 loc) · 3.29 KB

Automated Updates for Scoping Reviews of Environmental Drivers of Human and Animal Diseases

This study aims to compare three NLP methods for extracting named entities with complex labels and very limited training data.

We compare:

  • Fine-tuning BERT on NER classification
  • Data augmentation with GPT-3.5 and fine-tuning BERT on both the original and data-augmented training datasets
  • OpenAI (GPT-3.5 and GPT-4) with RAG based on the same training dataset

We trained our methods on an Influenza corpus and evaluated the ability of these approaches to generalize to other diseases (Leptospirosis and Chikungunya).

Reproduce the Article

0. Download the Papers and Their Manual Annotations

  1. Download the manual annotations:
  1. Download the papers used for the manual annotations:

    • Download all the papers mentioned in the manual annotation using their DOI.
  2. Convert PDFs into TEI:

1. Generate SpaCy-like Annotations

The two methods described below can be run using this notebook: generate_annotation.ipynb.

Work in progress: the notebook needs to be adapted to the data from the Zenodo repository.

From Manual Annotations:

The manual annotations are at the document level. To fine-tune BERT-like pre-trained models, we need to generate a SpaCy annotation schema.

From Data Augmentation Using GPT-3.5:

Use GPT-3.5 to create synthetic data from the manual annotations.

2. Train BERT-like Models and Create the RAG Process for LLMs

Train BERT-like Models:

Train 3 models:

  • roberta-base
  • microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  • FacebookAI/xlm-roberta-base

On two datasets:

  • From the manual annotations
  • From the manual annotations + synthetic data

All these 6 trainings can be done using this notebook: train_models.ipynb.

Work in progress: the path to the training dataset needs to be adapted to the current environment.

Then infer with the models trained on the whole datasets (the 3 diseases), using this script: full_article_inference.py.

RAG Process for LLMs:

Create a RAG database (FAISS) and a Langchain pipeline for:

  • GPT-3.5
  • GPT-4

Using this notebook: RAG.ipynb.

3. Evaluate the Results

Compare cosine similarity between pairs (annotation/prediction). Extract only the best match for each article (even if some articles have several covariates annotated).

Run this script: Evaluate_at_document_level.py.


Acknowledgement:

This study was partially funded by EU grant 874850 MOOD. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission

mood