In order to address the need for a high-quality QA dataset for Persian language, we propose a model for creating dataset for deep-learning-based QA systems. We deploy the proposed model to create PersianQuAD, the first native question answering dataset for the Persian language. PersianQuAD contains approximately 20,000 "question, paragraph, answer" triplets on Persian Wikipedia articles and is the first large-scale native QA dataset for the Persian language which is created by native annotators.
The proposed model consists of four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. We analysed PersianQuAD and showed that it contains questions of varying types and difficulties and hence, it is a good presenter of real-world questions in the Persian language. We built three QA systems using MBERT, ALBERT-FA and ParsBERT. The best system uses MBERT and achieves a F1 score of 82.97% and an Exact Match of 78.8%. The results show that the resulted dataset performs well for training deep-learning-based QA systems. We have made our dataset and QA models freely available and hope that it encourages the development of new QA datasets and systems for different languages, and leads to further advances in machine comprehension.
The dataset is available for download from the Dataset
directory. The statistics of the PersianQuAD is shown below:
Split | No. of questions | No. of Candidate Answers | Avg. of question length | avg. answer length |
---|---|---|---|---|
Train | 18567 | 1 | 10.7 | 2.6 |
Test | 1000 | 3 | 10.5 | 2.3 |
In the following, question type distribution over PersianQuAD dataset is illustrated:
Question Word | Distribution |
---|---|
What | 28.14% |
How | 15.24% |
When | 10.70% |
Where | 13.60% |
Who | 16.50% |
Which | 15.26% |
Why | 00.92% |
You can train and test the proposed model by running Main.ipynb
in the Google Colab enviroment. You must download the repository
and extract it to your Google Drive. Then, run Main.ipynb
by Google Colab and train your models.
We build three QA systems according to the pre-trained language models examined (MBERT, ALBERT-FA, ParsBERT). We trained each of the QA systems using the training part of PersianQuAD and evaluate them using the test part. We evaluate each of the QA systems according to two widely used automatic evaluation metrics Exact Match and F1.
Dataset | Model | Exact Match | F1 measure |
---|---|---|---|
PersianQuAD | Human | 95.00% | 96.49% |
PersianQuAD | Albert-FA | 74.90% | 79.25% |
PersianQuAD | ParsBERT | 73.80% | 79.08% |
PersianQuAD | MBERT | 78.80% | 82.97% |
A. Kazemi, J. Mozafari and M. A. Nematbakhsh, "PersianQuAD: The Native Question Answering Dataset for the Persian Language," in IEEE Access, vol. 10, pp. 26045-26057, 2022, doi: 10.1109/ACCESS.2022.3157289.
@ARTICLE{PersianQuAD-Access,
author={Kazemi, Arefeh and Mozafari, Jamshid and Nematbakhsh, Mohammad Ali},
journal={IEEE Access},
title={PersianQuAD: The Native Question Answering Dataset for the Persian Language},
year={2022},
volume={10},
number={},
pages={26045-26057},
doi={10.1109/ACCESS.2022.3157289}
}