This is the original implementation of the following paper.
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, Luke Zettlemoyer. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In: Proceedings of EMNLP (long). 2019
@inproceedings{ min2019discrete,
title={ A Discrete Hard EM Approach for Weakly Supervised Question Answering },
author={ Min, Sewon and Chen, Danqi and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
booktitle={ EMNLP },
year={ 2019 }
}
You can use hard EM updates for any weakly-supervised QA task where a precomputed solution set can be obtained, and can use with any model architecture. This is an example code for open-domain question answering using BERT QA model. The base code is from Huggingface's Pytorch-Transformers.
Codes for other tasks are coming soon, stay tuned!
In the paper, we experiment on six QA datasets in three different categories.
- Multi-mention reading comprehension
- Distantly-supervised reading comprehension: TriviaQA
- Reading comrepehension with free-form answers: NarrativeQA
- Open-domain QA: TriviaQA-open, NaturalQuestions-open
- Discrete Reasoning Task: DROP
- Semantic Parsing: WikiSQL
Below is the results reported in the paper (all on the test set).
Dataset | TriviaQA | NarrativeQA | TriviaQA-open | NaturalQuestions-open | DROP | WikiSQL |
---|---|---|---|---|---|---|
First-only | 64.9 | 57.4 | 48.1 | 23.6 | 42.9 | - |
MML | 65.5 | 56.1 | 47.4 | 25.8 | 39.7 | 70.5 |
Hard-EM (Ours) | 67.1 | 58.8 | 50.9 | 28.1 | 52.8 | 83.9 |
SOTA | 71.4 | 54.7 | 47.1 | 26.5 | 43.8 | 74.8 |
SOTA from Wang et al 2018, Nishida et al 2019, Lee et al 2019, Lee et al 2019, Dua et al and Agarwal et al 2019, respectively.
python 3.5
PyTorch 1.1.0
Download Data and BERT, and unzip them in the current directory.
- BERT: BERT Base Uncased in PyTorch
- Preprocessed Data: Preprocessed open-domain QA datasets with paragraphs retrieved through TF-IDF and BM25 (details below).
- Data: Original data before preprocessing, which contains
id
,question
andanswers
only (details below). This is not required for running the model, but just in case you want data before preprocessing.
Then, you can do
# NQ
./run.sh nq first-only
./run.sh nq mml
./run.sh nq hard-em 8000
# TriviaQA
./run.sh triviaqa first-only
./run.sh triviaqa mml
./run.sh triviaqa hard-em 4000
Here we release preprocessed data and source for our experiments on two open-domain QA datasets, NaturalQuestions-open (Kwiatkowski et al 2019) and TriviaQA-open (Joshi et al 2017).
For both datasets, we treat the dev set as the test set, and split the train set into 90/10 for training and development, following conventions that were also used in Chen et al 2017 and Lee et al 2019. For NaturalQuestions, follwoing Lee et al 2019, we take a subset of questions with short answers up to 5 tokens.
You can download this data from here. Each datapoint contains
id
a string, example id matching with the original dataquestion
: a stringanswers
: a list of string
For preprocessing, we retrieve paragraphs for each question through TF-IDF (for document retrieval; using DrQA from Chen et al 2017) and BM25 (for further paragraph retrieval). We filter train examples where the retriever fails to retrieve any paragraph with the answer text. Preprocessed data with retrieved paragraphs can be downloaded from here.
To use your own data, each line of the data file should be a dictionary (can be decoded by json) containing
id
: example idquestion
: question (string)context
: a list where each item is a tokenized paragraph (a list of list of string)answers
: a list where i-th item is an answer entry in i-th paragraph ofcontext
; answer entry is a dictionary containing (1)text
: answer text (2)word_start
/word_end
: index of the first/last answer word in the paragraphfinal_answers
: a list of answer texts for evaluation;text
inanswers
are always included infinal_answers
Example:
{
'id': 'user-input-0',
'question': 'Which city is University of Washington located in?',
'context': [["The", "University", "of", "Washington", "is", "a", "public", "research", "university", "in", "Seattle", ",", "Washington", ...],
["University", "of", "Washington", "has", "been", "affiliated", "with", "many", "notable", "alumni", "and", "faculty", ",", "including", ...]],
'answers': [[{'text': 'Seattle', 'word_start': 10, 'word_end': 10}, {'text': 'Seattle, Washington', 'word_start': 10, 'word_end': 12}],
[]],
'final_answers': ["Seattle", "Seattle, Washington"]
}
The model architecture is exactly same as Min et al 2019's model. We only modify loss functions to have different variations.
You can check the exact command line for training and evaluating the model in run.sh
. Some useful flags are as follows.
--train_batch_size
: batch size for training; experiments reported in the paper use batch size of 192--predict_batch_size
: batch size for evaluating--loss_type
: learning method, one of (i)first-only
which only considers the first answer span, (ii)mml
which uses maximum marginal likelihood objective, and (iii)hard-em
which uses hard em objective (our main objective)--tau
: hyperparameters for hard-em objective; only matters whenloss_type
ishard-em
; experiments reported in the paper use 4000 for TriviaQA-unfiltered and 8000 for NaturalQuestions--init_checkpoint
: model checkpoint to load; for training, it should be BERT checkpoint; for evaluating, it should be trained model--output_dir
: directory to store trained model and predictions--debug
: running experiment with only first 50 examples; useful for making sure the code is running--eval_period
: interval to evaluate the model on the dev data--n_paragraphs
: number of paragraphs per a question for evaluation; you can specify multiple numbers ("10,20,40,80"
) to see scores on different number of paragraphs--prefix
: prefix when storing predictions during evaluation--verbose
: specify to see progress bar for loading data, training and evaluating
For any question, please contact Sewon Min or post Github issue.