This is the code for work in Study of Attention Mechanisms and Adversarial Training for Question Answering.
- If your machine doesn't have a GPU, change
tensorflow-gpu==1.4.1
totensorflow==1.4.1
inrequirement.txt
. - Run the startup script
./get_started.sh
to create a conda environmentsquad
. This would download the GloVe word embeddings, download and pre-process SQuAD 1.1, SQuAD 2.0 and Adversarial SQuAD datasets (the AddSent version) and store them all indata
directory. - Activate the created environment
source activate squad
. The main script ismain.py
. You can check all the available parameters withpython main.py --help
.
python main.py --experiment_name=baseline --mode=train
should start training the baseline model.
python main.py --train_dir=baseline --mode=eval
will evaluate the model (give F1 and EM scores) trained in the experiment namedbaseline
.
python main.py --train_dir=baseline --mode=show_examples
will output 10 randomly selected samples of (context, question, predicted answer, true answer) forbaseline
model.
There are couple of switches to configure thee attention mechanism for the model.
--attention_model
takes two valuesuni-dir
(default) andbi-dir
.--attention_weight
takes two valuesweighted
andunweighted
(default).
--eval_squad_2
flag sets the model to train on SQuAD 2.0 datatset.--na_bias
takes two valuesb
(default) andw
for simple-bias and aggregated-bias as described in the paper.
Big shout out to authors of cs224n-win18-squad! This code is based off of it.
Reading comprehension task can expose the model's understanding of the language and meaning. Below are some of the datasets and possible ways of how they fall short on evaluating the model's true understanding of language and meaning.
-
- SOTA models trained on this dataset miserably fail in the face of adverserial examples.
- Simple heuristic based model performs near SOTA, putting the increasing complex models in perspective.
- Has all answerable questions (forces the model to answer a question when no correct answer exists).
- Perturb the dataset by associating the questions with other paragraphs to artificially generate unanswerable questions to make the model more robust.
Possible experiment:
- Shuffle the sentences of each paragraph and study the behaviour of the models. Shuffling the sentences in the paragraph might change the answers or in most cases make the answer ambiguous.
-
Squad 2.0 (with unanswerable questions)
- Makes the dataset harder by including questions with high token overlap with the context paragraph and also with plausible answer (same POS type) but no correct answer from the passage. This reinforces the model to have a greater understanding of language than pattern matching.
Possible experiment:
- Evaluate current SOTA models on Squad 2.0. Find the % of negetive examples that the model correctly deemed unanswerable. Add answers to those correclty predicted negative examples and check if the model answers now. (Still need to concretely formulate how to add answers to previously unanswerable questions.)
- A variant of BiDAF model was trained on both Squad 1.1 and Squad 2.0. Both models were compared against their performance on adversarial test set released by Robin et. al. If the model trained on Squad 2.0 performs better, this prooves the hypothesis that Squad 2.0 improves the language understanding capabilities of the models.