The KILT benchmark is described in the following paper:
@inproceedings{petroni-etal-2021-kilt,
title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks",
author = {Petroni, Fabio and Piktus, Aleksandra and
Fan, Angela and Lewis, Patrick and
Yazdani, Majid and De Cao, Nicola and
Thorne, James and Jernite, Yacine and
Karpukhin, Vladimir and Maillard, Jean and
Plachouras, Vassilis and Rockt{\"a}schel, Tim and
Riedel, Sebastian},
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.200",
doi = "10.18653/v1/2021.naacl-main.200",
pages = "2523--2544",
}
https://arxiv.org/abs/2009.02252
conda create -n kilt37 -y python=3.7 && conda activate kilt37
pip install -e .
The KILT knowledge source can be downloaded here: kilt_knowledgesource.json (34.76GiB).
It is based on the 2019/08/01 Wikipedia dump.
We use mongoDB to index the knowledge base (but you can use any json-based db).
To import the knowledge source in mongoDB run:
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json
mongoimport --db kilt --collection knowledgesource --file kilt_knowledgesource.json
{
'wikipedia_title': 'Email marketing',
'wikipedia_id': 1101759,
'text': ['p1', 'p2',...., 'pn'], # list of paragraph text
'anchors': [{"text":,"href":,"paragraph_id":,"start":,"end":} ] ,
'categories': 'comma separated list of categories'
'history': # some info from wikipedia, including original url
'wikidata_info': # wikidata info
}
from kilt.knowledge_source import KnowledgeSource
# get the knowledge souce
ks = KnowledgeSource()
# count entries - 5903530
ks.get_num_pages()
# get page by id
page = ks.get_page_by_id(27097632)
# get pages by title
page = ks.get_page_by_title("Michael Jordan")
mkdir data
python scripts/download_all_kilt_data.py
python scripts/get_triviaqa_input.py
You can also download and use the KILT data through the HuggingFace's nlp library.
Note that we release only the input for the test sets, without answers. Test answers are used for the KILT challenge on EvalAI where participants can upload their models’ predictions and be listed on the public leaderboard (there are strict submission limits to discourage overfitting on test data).
{'id': # original data point id if available otherwise unique id
'input': # question / claim / sentence / etc
'output': [ # each element might contain an answer, a provenance or both
{
'answer': # answer in textual form
'provenance': [
# evidence set for the answer from the KILT ks
{
'wikipedia_id': # *mandatory*
'title':
'section':
'start_paragraph_id':
'start_character':
'end_paragraph_id':
'end_character':
'bleu_score': # wrt original evidence
'meta': # dataset/task specific
}
]
}
]
'meta': # dataset/task specific
}
* run python scripts/get_triviaqa_input.py
to get the question associated with each id
For Entity Linking, in addition to the AIDA CoNLL-YAGO train set, the whole knowledge source can be used as training data by exploiting hyperlinks. To facilitate experimentation, we release such data in KILT format following the splits of BLINK:
- blink-train-kilt.jsonl (9M lines)
- blink-dev-kilt.jsonl (10,000 lines)
We also provide a script to map the TAC-KBP 2010 dataset to the knowledge source and format of KILT.
Please follow this README.
Mapping scripts are located in kilt/datasets/
.
See scripts/map_datasets.py
for an example.
If the module cannot be found, preface the python command with PYTHONPATH=.
If the experiments fail on GPU memory allocation, try reducing batch size.
KILT is MIT licensed. See the LICENSE file for details.