ArabicaQA: Comprehensive Dataset for Arabic Question Answering

ArabicaQA is a robust dataset designed to support and advance the development of Arabic Question Answering (QA) systems. This dataset encompasses a wide range of question types, including both Machine Reading Comprehension (MRC) and Open-Domain questions, catering to various aspects of QA research and application. The dataset is structured to facilitate training, validation, and testing of Arabic QA models.

Demo

Try Our Demo here

Requirements

# for inference
pip install torch==1.5.1
pip install faiss-cpu==1.7.3
pip install transformers==3.0.0

Using AraDPR

To use our AraDPR for question answering, follow the steps below:

Step 1: Clone AraDPR Repository

First, download the AraDPR model by cloning the repository:

git clone https://huggingface.co/abdoelsayed/AraDPR

After cloning, move the AraDPR model directory to DPR/Model within your project structure:

Step 2: Clone AraDPR Index

Next, download the DPR index required for running AraDPR:

git clone https://huggingface.co/abdoelsayed/AraDPR_index

Once downloaded, move the AraDPR index directory to DPR/DPR_index within your project structure:

Step 3: Wikipeda Data

Next, download the TSV:

TSV

Once downloaded, move the wikiAr.tsv to wiki within your project structure:

Step 4: Running Inference

With the AraDPR model and index in place, you can run inference to answer questions. Edit the inference.py script to include your questions or use the example provided in the script.

To run the inference, execute:

python inference.py

Step 5: Review Results

The results of your inference will be saved in result.json. Open this file to review the answers provided by the AraDPR model to your questions.

Dataset Overview

ArabicaQA is divided into several segments to address different QA challenges:

Machine Reading Comprehension (MRC): Contains questions with provided context paragraphs and specified answers. It includes both answerable and unanswerable questions to mimic real-world scenarios where some questions may not have straightforward answers.
Open-Domain QA: Designed for scenarios where questions are asked in an open context, encouraging models to retrieve relevant information from a broad dataset.
Retriever Training Data: Offers structured data to train retriever models, which are crucial for identifying relevant context or documents from a large corpus.

Dataset Statistics

Category	Training	Validation	Test
MRC (with answers)	62,186	13,483	13,426
MRC (unanswerable)	2,596	561	544
Open-Domain	62,057	13,475	13,414
Open-Domain (Human)	58,676	12,715	12,592

Download Links

MRC Dataset

Structured as JSON files, the MRC dataset includes train.json, val.json, and test.json for training, validation, and testing phases, respectively, along with a metadata CSV file.

Data Structure:

Click to maximize


{
  "data": [
    {
      "title": "",
      "paragraphs": [
        {
          "context": "",
          "qas": [
            {
              "question": "",
              "id": "",
              "answers": [
                {
                  "answer_start": 0,
                  "text": ""
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Training Set: Download
Validation Set: Download
Test Set: Download
Metadata: Download

Open-Domain QA Dataset

Available in both JSON and JSONL formats, this part of the dataset is annotated by humans for realistic QA scenarios.

Data Structure:

Click to maximize


[
    {
        "question_id": "",
        "answer_id": "",
        "question": "",
        "answer": ""
    }
]

JSON Format: Train | Validation | Test
JSONL Format: Train | Validation | Test

Retriever Training Data

This section provides datasets for training retrieval models, crucial for efficient information extraction and context identification.

Data Structure:

Click to maximize


[
    {
        "question": "...",
        "answers": ["...", "...", "..."],
        "positive_ctxs": [{
            "title": "...",
            "text": "...."
            }],
        "negative_ctxs": ["..."],
        "hard_negative_ctxs": ["..."]
    }
]

Haystack Annotated: Train | Validation | Test
Human Annotated: Train | Validation | Test
CSV Format: Train | Validation | Test

Retriever Data Output

Outputs from the retrieval models, showcasing the effectiveness of different retrieval strategies (DPR, BM25) in context selection.

Data Structure:

Click to maximize


[
    {
    "question": "...",
    "answers": ["...", "..."],
    "ctxs": [
        {
            "id": "...",
            "title": "",
            "text": "....",
            "score": "...",
            "has_answer": true|false
        }
     ]
    }
]

DPR Output: Train | Validation | Test
BM25 Output: Train | Validation | Test

Wikipedia data

Data Structure:

id	text	title

Wikipedia: TSV

Trainin AraDPR

Will be avaiable soon

Citation

If you find these codes or data useful, please consider citing our paper as:

@inproceedings{10.1145/3626772.3657889,
author = {Abdallah, Abdelrahman and Kasem, Mahmoud and Abdalla, Mahmoud and Mahmoud, Mohamed and Elkasaby, Mohamed and Elbendary, Yasser and Jatowt, Adam},
title = {ArabicaQA: A Comprehensive Dataset for Arabic Question Answering},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657889},
doi = {10.1145/3626772.3657889},
abstract = {In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2049–2059},
numpages = {11},
keywords = {arabic question answering, information retrieval, llm, question generation},
location = {Washington DC, USA},
series = {SIGIR '24}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
DPR		DPR
.gitignore		.gitignore
Inference.py		Inference.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabicaQA: Comprehensive Dataset for Arabic Question Answering

Demo

Requirements

Using AraDPR

Step 1: Clone AraDPR Repository

Step 2: Clone AraDPR Index

Step 3: Wikipeda Data

Step 4: Running Inference

Step 5: Review Results

Dataset Overview

Dataset Statistics

Download Links

MRC Dataset

Open-Domain QA Dataset

Retriever Training Data

Retriever Data Output

Wikipedia data

Trainin AraDPR

Citation

About

Releases

Packages

Languages

License

DataScienceUIBK/ArabicaQA

Folders and files

Latest commit

History

Repository files navigation

ArabicaQA: Comprehensive Dataset for Arabic Question Answering

Demo

Requirements

Using AraDPR

Step 1: Clone AraDPR Repository

Step 2: Clone AraDPR Index

Step 3: Wikipeda Data

Step 4: Running Inference

Step 5: Review Results

Dataset Overview

Dataset Statistics

Download Links

MRC Dataset

Open-Domain QA Dataset

Retriever Training Data

Retriever Data Output

Wikipedia data

Trainin AraDPR

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages