This repository aims to group Moroccan Darija Datasets to help make them available at once without the need to spend endless amounts of time looking for one. The datasets are grouped by name, data source, region, and size to provide as much information as possible to select the best dataset for the task at hand.
Dataset | Data source | Region | Size | Link | Reference | |
---|---|---|---|---|---|---|
1 | Moroccan Arabic Sentiment Analysis Corpus | Maghrebi (Moroccan) | 2000 entries | source | 2018 [1] | |
2 | IADD: An integrated Arabic dialect identification dataset | Varied | Maghrebi, Levantine, Egyptian and Gulf | 135,804 texts | source | 2022 [2] |
3 | Dialectal Arabic Datasets | Maghrebi, Levantine, Egyptian and Gulf | 350 tweets per region | source | 2018 [3] | |
4 | MSDA Open Datasets | Social media posts | Arabic | - | source | 2020 [4] |
5 | Moroccan Dialect Darija Open Dataset | Open source contributions | Maghrebi (Moroccan) | More than 13K | source | 2021 [5] |
6 | Goud.ma: a News Dataset for Summarization in Moroccan Darija | goud.ma | Maghrebi (Moroccan) | 158k news articles | source | 2022 [6] |
7 | MNAD : Moroccan News Articles Dataset | Moroccan news websites | Maghrebi (Moroccan) | 418 563 documents | source | 2021 [7] |
8 | QADI: QCRI Arabic Dialect Identification | Maghrebi, Levantine, Egyptian and Gulf | 540k tweets | source | 2020 [8] | |
9 | Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic | Voice recordings + text transcriptions | Maghrebi (Moroccan) | 2392 training and 600 testing files | source | 2021 [9] |
10 | ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels | Images collected on different Moroccan highways, annotated manually. | Maghrebi (Moroccan) | 1763 images | source | 2020 [10] |
11 | OMCD: Offensive Moroccan Comments Dataset | A collection of comments from YouTube that have been labeled for offensive content. | Maghrebi (Moroccan) | 8024 comments written in Moroccan dialect | source | 2023 [11] |
12 | MORED: A Moroccan Buildings’ Electricity Consumption Dataset | A dataset that comprises electricity consumption data of various Moroccan premises | Maghrebi (Moroccan) | - | source | 2020 [12] |
13 | DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect | arNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija | Maghrebi (Moroccan) | 65,905 tokens | source | 2023 [13] |
- [1] Ahmed Oussous, Ayoub Ait Lahcen, and Samir Belfkih. Improving sentiment analysis of oroccan tweets using ensemble learning. In BDCA, 2018.
- [2] Jihad Zahir. Iadd: An integrated arabic dialect identification dataset. Data in Brief, 40:107777, 2022.
- [3] Ahmed Abdelali Mohamed Eldesouki Younes Samih Randah Alharbi Mohammed Attia Walid Magdy Kareem Darwish, Hamdy Mubarak and Laura Kallmeyer. Multi-dialect arabic pos tagging: A crf approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), may 2018.
- [4] An open access NLP dataset for Arabic dialects : data collection, labeling, and model construction, Elmehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun, Ikram Chairi and Ismail Berrada MENACIS 2020 conference, In press.
- [5] Outchakoucht Aissam and Es-Samaali Hamza. Moroccan dialect -darija- open dataset, 2021.
- [6] Abderrahmane Issam, Khalil Mrini. Goud.ma: a News Article Dataset for Summarization in Moroccan Darija, 2022.
- [7] Jbene Mourad, Smail Tigani, Saadane Rachid, Abdellah Chehri. A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization, 2021
- [8] Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Sabit Hassan, Kareem Darwish. Arabic Dialect Identification in the Wild, 2020.
- [9] Imade Benelallam, Anass Allak, and Abdou Mohamed Naira. Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic, September 2021.
- [10] Mohammed Akallouch , Kaoutar Sefrioui Boujemaa , Afaf Bouhoute , Khalid Fardousse , and Ismail Berrada. ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels. 2020
- [11] Kabil Essefar, Hassan Ait Baha, Abdelkader El Mahdaouy, Abdellah El Mekki & Ismail Berrada. OMCD: Offensive Moroccan Comments Dataset, 2023
- [12] Mohamed Aymane Ahajjam, Daniel Bonilla Licea, Chaimaa Essayeh, Mounir Ghogho, and Abdellatif Kobbane. MORED: A Moroccan Buildings’ Electricity Consumption Dataset , 2020
- [13] Mousa, Hanane Nour; Mourhir, Asmaa (2023), “DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect”, Mendeley Data, V4, doi: 10.17632/286sss4k9v.4