Image courtesy of Stable Diffusion 2.1
This repo is the code for Baseline fact Extraction and VERification System (BEVERS). The pipeline utilizes standard approaches for each component in the pipeline. Despite it's simplicity, BEVERS achieves SOTA performance on FEVER (old leaderboard, new leaderboard) and achieves the highest label accuracy F1 score on SciFact (leaderboard).
- conda
To create the bevers
conda environment, Python requirements, and other requirements run the setup.sh
script. The script requires sudo
access for setting up SQLite as a fuzzy string search engine.
bash setup.sh
There is a run script for FEVER, SciFact, and PubMed. The general pipeline of BEVERS is as follows (PubMed is an exeception):
- TF-IDF setup (and fuzzy string search for FEVER)
- Sentence selection dataset generation, model training, and final dumping of sentence scores.
- Claim classification training and dumping of claim scores.
- Training of XGBoost classifier
- Generating final output files for submission to leaderboards for scoring.
# Run FEVER
bash run_fever.sh
# Run PubMed (NOTE: manual effort is needed here to download required dataset files)
bash run_pubmed.sh
# Run SciFact (running PubMed is a prerequisite here)
bash run_scifact.sh
System | Test Label Accuracy | Test FEVER Score |
---|---|---|
LisT5 | 79.35 | 75.87 |
Stammbach | 79.16 | 76.78 |
ProoFVer | 79.47 | 76.82 |
Ours (RoBERTa Large MNLI) mixed | 79.39 | 76.89 |
Ours (DeBERTa v2 XL MNLI) mixed | 80.24 | 77.70 |
System | SS + L F1 | Abstract Label Only F1 |
---|---|---|
VerT5erini | 58.8 | 64.9 |
ASRJoint | 63.1 | 68.1 |
MultiVers | 67.2 | 72.5 |
Ours | 58.1 | 73.2 |
- Release models - Done via Docker image (03/05/23)
- Finish cleaning up code (started on this but didn't finish)
- Update demo - Done (03/02/23)
- Some of the code was simply copied from the evaluation repos for ease of use. Properly document source of code that is not mine.
- Improve retrieval for SciFact utilizing neural re-rankers like most other systems do.
- Release easy to use predictions for sentence selection. This helps people who only want to focus on the claim classification portion of task. - Done via Docker image (03/05/23)
In my initial code clean up I changed a decent amount of code and prior to release I wanted to make sure the results were replicable, so I ran regression tests for FEVER and SciFact.
Run | Test Label Accuracy | Test FEVER Score |
---|---|---|
Published (RoBERTa Large MNLI) | 79.39 | 76.89 |
Regression (02/20/23) | 79.31 | 76.91 |
Published (DeBERTa v2 XL MNLI) | 80.24 | 77.70 |
Regression (02/22/23) | 80.35 | 77.86 |
Run | SS + L F1 | Abstract Label Only F1 |
---|---|---|
Published | 58.1 | 73.2 |
Regression (02/26/23) | 58.3 | 73.8 |
As an means of distributing the system, BEVERS is made available as a Docker image
- BEVERS:
docker pull mitchelldehaven/bevers
- BEVERS frontend:
docker pull mitchelldehaven/bevers_frontend
For running the demo, the following must be done for the backend Flask API:
docker run -p 5000:5000 -it --gpus all mitchelldehaven/bevers
conda activate bevers
export DATASET=fever
export PYTHONPATH=.
export FLASK_APP=demo/backend/app.py
flask run --host=0.0.0.0
For running the demo, the follwing must be done for the frontend Angular UI:
docker run -p 4200:4200 -it mitchelldehaven/bevers_frontend
After both docker images are running, the demo is accessible by visiting http://localhost:4200/
.
There is a simple UI for demoing the model. The current setup is a lighter version of what was used in for the best results for reducing compute requriement.
For running the backend Flask API:
export DATASET=fever
export PYTHONPATH=.
python demo/src/app.py
cd demo/frontend
npm i
ng serve
There is a simple gif showing the demo below in order to avoid having to setup the demo to see what it does.