Skip to content
/ vod-qa Public
forked from FindZebra/fz-openqa

Variational Open-Domain - Question Answering and Language Modelling

Notifications You must be signed in to change notification settings

VodLM/vod-qa

 
 

Repository files navigation

Warning Work in progress

Variational Open-Domain Question Answering

Python PyTorch Lightning Config: hydra Code style: black

PWC

PWC

Abstract

Retrieval-augmented models have proven to be effective in natural language processing tasks, yet there remains a lack of research on their optimization using variational inference. We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models, focusing on open-domain question answering and language modelling. The VOD objective, a self-normalized estimate of the Rényi variational bound, is a lower bound to the task marginal likelihood and evaluated under samples drawn from an auxiliary sampling distribution (cached retriever and/or approximate posterior). It remains tractable, even for retriever distributions defined on large corpora. We demonstrate VOD's versatility by training reader-retriever BERT-sized models on multiple-choice medical exam questions. On the MedMCQA dataset, we outperform the domain-tuned Med-PaLM by +5.3% despite using 2.500x fewer parameters. Our retrieval-augmented BioLinkBERT model scored 62.9% on the MedMCQA and 55.0% on the MedQA-USMLE. Last, we show the effectiveness of our learned retriever component in the context of medical semantic search.

Setup

  1. Setting up the Python environment
curl -sSL https://install.python-poetry.org | python -
poetry install
  1. Setting up ElasticSearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.14.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.14.1-linux-x86_64.tar.gz

To run ElasticSearch navigate to the elasticsearch-7.14.1 folder in the terminal and run ./bin/elasticsearch.

  1. Run the main script
poetry run python run.py

Citation

@misc{https://doi.org/10.48550/arxiv.2210.06345,
  doi = {10.48550/ARXIV.2210.06345},
  url = {https://arxiv.org/abs/2210.06345},
  author = {Liévin, Valentin and Motzfeldt, Andreas Geert and Jensen, Ida Riis and Winther, Ole},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7; H.3.3; I.2.1},
  title = {Variational Open-Domain Question Answering},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Credits

The package relies on:

  • Lightning to simplify training management including distributed computing, logging, checkpointing, early stopping, half precision training, ...
  • Hydra for a clean management of experiments (setting hyper-parameters, ...)
  • Weights and Biases for clean logging and experiment tracking
  • Poetry for stricter dependency management and easy packaging
  • The original template was copied form ashleve

About

Variational Open-Domain - Question Answering and Language Modelling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.7%
  • Jupyter Notebook 24.1%
  • Other 0.2%