Repo of code and data for ECIR 2021 short paper "PGT: Pseudo Relevance Feedback Using a Graph-based Transformer"
Abstract: Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models. This paper shows that Transformer-based rerankers can also benefit from the extra context that PRF provides. It presents PGT, a graph-based Transformer that sparsifies attention between graph nodes to enable PRF while avoid- ing the high computational complexity of most Transformer architec- tures. Experiments show that PGT improves upon non-PRF Transformer reranker, and it is at least as accurate as Transformer PRF models that use full attention, but with lower computational costs.
python 3.7.0
pytorch 1.6.0
dgl 0.5.2
transformers 1.0.0
Please download data.zip
from our virtual appendix here. Unzip the file, and place the unzipped data
folder directly in this repo (so it becomes pgt/data
).
The data.zip
folder is structured as follows, where ${i} is the file index. The training set is too large to fit into the RAM, so we break it into 26 blocks and read 1 block at a time to save the working memory.
data
- trec
- manual-qrels-pass.txt # gold standard query relevance judgements for TREC 2019 dev set
- top7 # the raw data and the tokenized data where 7 feedback documents are used (k=7)
- bm25_train # training data obtained using BM25 as the initial ranker
- train.graph.top7.json${i} # raw training data
- bm25_test # test data obtained using BM25 as the initial ranker
- trec.top7.test.graph.json # raw test data
- pids.tsv # qid \t pid (each qid corresponds to 1000 top initial retrieval pids)
- crm_test # same as bm25_test
- ...
train.graph.top7.json${i}$
formats the graph inputs as json structures. Each line in the file is one graph corresponding to one candidate document of one query. For example
{"qid": "597347", # query id
"query": "what color are the four circles ...",
"candidate": ["bee", "##t", "varieties", ...], # byte-pair-encoded candidate document
"label": 0, # binary relevance label of the candidate document
"node": [
{"node_id": 0,
"passage": ["each", "row", "should", ...], # byte-pair-encoded feedback document 0
"label": 0}, # binary relevance label of the feedback document (not used in our experiments)
{"node_id": 1, ...}
{"node_id": 2, ...},
...
{"node_id": 6, ...}
]
}
Tokenizing the data takes time. The following script tokenizes the training data and saves them into .cache files, which can be read by the software once detected. You may run the script on CPU (requires the CPU-version dgl
) and in parallel across i
s.
for i in {0..25}
do
python ./main.py \
--cf=config.json \
--load_train \
--data_path=./data/top7/bm25_train/train.graph.top7.json${i}
done
Similarly, to tokenize test data, run:
python ./main.py \
--cf=config.json \
--load_test \
--data_path=./data/top7/bm25_test/trec.top7.test.graph.json
The following script trains the model in a distributed manner on 2 GPUs.
python -m torch.distributed.launch --nproc_per_node=2 ./main.py \
--checkpoint=25000 \
--train \
--cf=config.json \
--distributed
Single-GPU training is also possible with
python ./main.py \
--checkpoint=25000 \
--train \
--cf=config.json \
The following script tests the model on TREC dev set. Change config["system"]["test_data"]
to use either BM25 or CRM as the initial ranker.
python ./main.py \
--cf=config.json \
--test \
--model_path=./top7/epoch1_final.pt \
--test_output=./top7/epoch1_final.out
Then run
python ./pred_to_ranking.py
--prefix=./top7/epoch1_final \
--initial_ranker=bm25 \
--trec_eval_path=your_trec_eval_path/trec_eval \
--gold_path=./data/trec/manual-qrels-pass.txt
which produces ./top7/epoch1_final.csv
reporting MAP and NDCG scores at different reranking depths.
config.json
sets the hyperparameter and dataset paths.
config["ablation"]
controls the graph structure as described in Section 4.3 of our paper. You can use one of the following five strings: base
(PGT base), exp1
(base w/o pre d_c), exp2
(base w/o pre q, d_c), exp3
(base w/o node d_c), and exp4
(base w/o node q, d_c).
We provide models trained using 7 feedback documents (k=7) with different graph structures here.