Skip to content

kuzgnlar/datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kuzgunlar Turkish Datasets

GPL 3.0

NER

The dataset of Kuzgunlar Turkish Electra NER Model

Sahin, H. Bahadir; Eren, Mustafa Tolga; Tirkaz, Caglar; Sonmez, Ozan; Yildiz, Eray (2017), “English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset”, Mendeley Data, v1 http://dx.doi.org/10.17632/cdcztymf4k.1

It is created by reducing the classes of the above dataset to 48 classes. It is shared over Kaggle due to Github file size restrictions.

Question Answer

The dataset of Kuzgunlar Turkish Electra Question-Answer Model.

It is prepared using wikipedia contents to use with TQUAD, which is an open source Turkish question-answer dataset.

Sentence

It is a sentence dataset created by processing ~251 GB online Turkish pdf data for Masked LM applications. It is shared over Kaggle due to the file size.

  • By using TurkishDeasciifier, Turkish character misspelled words were rearranged in accordance with their original.

  • Using Zemberek, the Turkish ratio of words in the sentence content was defined as 80% and above. In this way, overlooked non-Turkish pdfs and sentence quotations in foreign languages were tried to be avoided as much as possible.

Releases

No releases published

Packages

No packages published