Resources

5 000 yes-no questions
Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
Machine translated
Can be used also for summarization

Morpho-syntactic

Korpus právnych predpisov v slovenčine

Korpus obsahuje texty právnych predpisov (aktuálnych aj minulých) v slovenčine. Okrem automatickej lematizácie a morfologickej anotácie je korpus anotovaný aj syntakticky
Citácia: GARABÍK, Radovan: Corpus of Slovak legislative documents. Jazykovedný časopis, 2022, Vol. 73, No 2, pp. 175-189.
45 miliónov tokenov

Morphological vocabulary (old version)

form, lemma, POS+MSD (SNK)
source: SNK

Morphological vocabulary (web demo)

form, lemma, POS+MSD (SNK)
source: SNK

Lemmatization, Morphological analysis, disambiguation (tagging), web demo

form, lemma, POS+MSD
source: JÚĽŠ SAV

Lemmatization, Morphological analysis, disambiguation (tagging), web API

form, lemma, POS+MSD
source: JÚĽŠ SAV

Lemmatization, Morphological analysis, disambiguation (tagging), diacritics-less Slovak, web demo

form, lemma, POS+MSD
source: JÚĽŠ SAV

Slovak Dependency Treebank

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, PDT tagset
source: SNK

Reference:

Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.

Slovak Universal Dependencies

A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.

GitHub page
tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, UD tagset
source: SNK

Reference:

Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.

Artificial Treebank with Ellipsis

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
format: conllu
source: Slovak UD, SNK

MULTEXT-East free lexicons 4.0

form, lemma, POS (Multext East)

Parallel

ŠarišSet

Corpus of the Šariš dialect
4.7k examples.
authors: Viktória Ondrejová and Marek Šuppa

OpenSubtitles

62 languages, 1,782 bitexts
Slovak part contains 100 mil. tokens

VoxPopuli

source: Europarl
speech, vectors, language

Czech-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

English-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

MULTEXT-East "1984" annotated corpus

sentence aligned, POS
Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
source: "1984" novel

Reference:

Erjavec, Tomaž; et al., 2010, MULTEXT-East "1984" annotated corpus 4.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1043.

Paracrawl

Parallel web Corpus with Slovak Part
3.3 mil sentences English-Slovak

WikiMatrix

Unsupervised processing of Wikipedia to obtain parallel corpora
Used LASER embeddings.
85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English

Semantic textual similarity

STSB-sk

Machine translated by OPUS-en-sk model
Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
Referenced from this report by J. Agarský.

Sentiment

Sentiment Analysis Data for the Slovak Language

5k items
positive and negative class
Reference: Samuel Pecar, Marian Simko, and Maria Bielikova. 2019. Improving Sentiment Classification in Slovak Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 114–119, Florence, Italy. Association for Computational Linguistics.

Slovak_sentiment

Unknown/undocumented source
positive/negative

Twitter sentiment for 15 European languages

source: Twitter
3 categories - positive, negative, neutral

SentiGrade

Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.

STS2-sk

Machine translated
Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
Referenced from this report by J. Agarský.

sk csfd movie reviews

CSFD Movie Reviews
25k items

Hate Speech

Hate Speech Slovak

13k items
Crowdsourced hate and offensive speech in Facebook comments
binary classification

Fact Checking

MultiClaim

Multilingual fact checking database with Slovak part
Contains 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups.

Demagog

9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

qacg-sk

Machine translated facts with evidence representend as references to Wikipedia pages.
350k items

Instructions

SlovAlpaca

Machine translation of the Stanford Alpaca
40k annotations

Named Entity Recognition

CNEC 2.0 cs2sk

CNEC 2.0 Czech model machine translated to Slovak & filtered
CNEC entity hierarchy
source: JÚĽŠ SAV

NER web demo

models: CNEC 2.0 cs2sk, morphodita SNK
source: JÚĽŠ SAV, SNK

Universal NER Slovak

8,48k sentences
Annotated by a large langauge model
PER, ORG, LOC annotations

Universal NER (UNER) Slovak SNK

8,48k train, 1k dev and 2k test sentences from Universal Dependencies
Human annotated by 2 annotators as part of the Universal NER project
PER, ORG, LOC annotations

WikiGold

10k manually annotated items from Wikipedia

ju-bezdek/conll2003-SK-NER

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

Polyglot NER

automatically annotated Wikipedia for Named Entities
massively multilingual
Slovak part has 500k sentences.
Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.

Cross-lingual Name Tagging and Linking for 282 Languages

download data
automatic annotation
source: Wikipedia

Contextualized Language Model-based Named Entity Recognition in Slovak Texts

Manually annotated set
Diploma thesis at Commeius University
PER, ORG, LOC, MISC annotations
cca 7k sentences.

Spelling

CHIBI

corpus of spelling errors created from edits in Wikipedia
spelling errors are sorted into 5 categories,

Wordnet

Slovak Wordnet

Summarization

Eur Lex Sum

Multilingual Summarization Dataset
Slovak part has 1.3k rows.

A Summarization Dataset of Slovak News Articles

200k of news article summaries
Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
MODELS.md		MODELS.md
README.md		README.md

andrejridzik/resources

Folders and files

Latest commit

History

Repository files navigation

Resources

Pages

Corpora, datasets, vocabularies

Web

Question Answering

Morpho-syntactic

Korpus právnych predpisov v slovenčine

Parallel

Semantic textual similarity

Sentiment

Hate Speech

Fact Checking

Instructions

Named Entity Recognition

Spelling

Wordnet

Summarization

About

Resources

Stars

Watchers

Forks