The dataset of Kuzgunlar Turkish Electra NER Model
Sahin, H. Bahadir; Eren, Mustafa Tolga; Tirkaz, Caglar; Sonmez, Ozan; Yildiz, Eray (2017), “English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset”, Mendeley Data, v1 http://dx.doi.org/10.17632/cdcztymf4k.1
It is created by reducing the classes of the above dataset to 48 classes. It is shared over Kaggle due to Github file size restrictions.
The dataset of Kuzgunlar Turkish Electra Question-Answer Model.
It is prepared using wikipedia contents to use with TQUAD, which is an open source Turkish question-answer dataset.
It is a sentence dataset created by processing ~251 GB online Turkish pdf data for Masked LM applications. It is shared over Kaggle due to the file size.
-
By using TurkishDeasciifier, Turkish character misspelled words were rearranged in accordance with their original.
-
Using Zemberek, the Turkish ratio of words in the sentence content was defined as 80% and above. In this way, overlooked non-Turkish pdfs and sentence quotations in foreign languages were tried to be avoided as much as possible.