Kuzgunlar Turkish Datasets

NER

The dataset of Kuzgunlar Turkish Electra NER Model

Sahin, H. Bahadir; Eren, Mustafa Tolga; Tirkaz, Caglar; Sonmez, Ozan; Yildiz, Eray (2017), “English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset”, Mendeley Data, v1 http://dx.doi.org/10.17632/cdcztymf4k.1

It is created by reducing the classes of the above dataset to 48 classes. It is shared over Kaggle due to Github file size restrictions.

Question Answer

The dataset of Kuzgunlar Turkish Electra Question-Answer Model.

It is prepared using wikipedia contents to use with TQUAD, which is an open source Turkish question-answer dataset.

Sentence

It is a sentence dataset created by processing ~251 GB online Turkish pdf data for Masked LM applications. It is shared over Kaggle due to the file size.

By using TurkishDeasciifier, Turkish character misspelled words were rearranged in accordance with their original.
Using Zemberek, the Turkish ratio of words in the sentence content was defined as 80% and above. In this way, overlooked non-Turkish pdfs and sentence quotations in foreign languages were tried to be avoided as much as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
question-answer		question-answer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_video.mp4		demo_video.mp4
header_background.jpg		header_background.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kuzgunlar Turkish Datasets

NER

Question Answer

Sentence

About

Releases

Packages

Contributors 2

License

kuzgnlar/datasets

Folders and files

Latest commit

History

Repository files navigation

Kuzgunlar Turkish Datasets

NER

Question Answer

Sentence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages