A comprehensive parallel dataset designed for the task of spell checking in Persian. Misspelled sentences together with the correct form are produced using a massive confusion matrix, which is gathered from many sources. This dataset contains informal sentences in addition to the formal sentences, and contains texts from diverse topics. Both non-word and real-word errors are collected in the dataset
Our approach is based on a large corpus of Persian texts in addition to the confusion matrix. Confusion matrix is a set of words that may mistakenly be replaced with each other, like ‘there’ and ‘their’ in English. We gathered a confusion matrix containing 2,072,396 pairs of words from various sources, which are explained below. Given the confusion matrix, we make our parallel dataset by replacing correct words of corpus sentences with words which are confusing with them.
Following shows some statistics of PerSpellData:
Errors | Confusion Matrix | PerSpellData |
---|---|---|
non-word errors | 643,849 | 3.8M |
real-word errors | 1,428,547 | 2.5M |
Total | 2,072,396 | 6.4M |
Example of real-word and non-word errors in Persian and English:
English Errors | Persian Errors | |||||
---|---|---|---|---|---|---|
Error type | Correct Form | Wrong Form | Correct Form | Wrong Form | ||
non-word | insertion | This story is embracing | This storey is embracing | خوشبختانه همه هنوز دچار نشده اند | خوشبخنانه همه هنوز دچار نشده اند | |
deletion | She is an actress | She is an acress | مردم آن شهر خیلی خسته بودند | مردم آن شهر خیی خسته بودند | ||
substitution | Tehran is the capital of Iran | Tehran is the capitol of Iran | ساعت هفت بیدار میشوم | صاعت هفت بیدار میشوم | ||
transposition | He is afraid of bears | He is afraid of bares | از آنجا تاکسی گرفتیم | از آنجا تاکسی گرتفیم | ||
real-word | insertion | Good jobs are found in big cities | Good jobs are found ink big cities | در این مکان اسکان کنید | در این مکان استکان کنید | |
deletion | They live on their own | They live on their on | گرادیان این زاویه چند است؟ | گدایان این زاویه چند است؟ | ||
substitution | I cannot see you | I cannot sea you | این مبل گران است | این مبل میان است | ||
transposition | I live here | I live heer | این عدد بر مبنای دو است | ین عدد بر مبانی دو است | ||
same pronunciation | This is too much money | This is two much money | این میوه پرتقال است | این میوه پرتغال است | ||
word boundary | You can do it | Youcan do it | به خانه می روم | به خانه میروم |
For some error type we provide two files, one of them is confusion matrix and the other is perSpellData parallel corpus. all of PerSpellData is upladed and can be downloaded. Here are statistics and links of different type of errors:
Type | Error-Type | Confused-words | PerSpellData |
---|---|---|---|
Real-word | Virastman's logs | 1034 | 7,753 |
Real-word | Synthetic | 1,425,693 | 2,959,054 |
Real-word | Make informal plural again plural | 165 | 2,968 |
Real-word | Common mistakes | 87 | 847 |
Real-word | Gozar | 296 | 2,088 |
Real-word | Tanvin | 79 | 448 |
Non-word | Be | 515 | 1520 |
Non-word | FaSepell | 5,063 | 8,953 |
Non-word | Virastman's logs | 136,164 | 467,946 |
Non-word | Close words | 502,107 | 1,440,854 |
Non-word | CPG | - | 707 |
If you use or discuss this dataset in your work, please cite our paper:
@inproceedings{persian-2021-romina-oji,
title = "Romina Oji, Nasrin Taghizadeh and Heshaam Faili",
author = "Persian, PerSpellData: An Exhaustive Parallel Spell Dataset For",
booktitle = "Proceedings of The Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021",
month = "12--13 " # nov,
year = "2021",
address = "Trento, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.nsurl-1.2",
pages = "8--14",
}
If you have any technical question regarding the dataset or publication, please create an issue in this repository.