- Contains Floret Word Vectors.
- Tagger module uses Slovak National Corpus Tagset.
- Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
- Lemmatizer is trained on Slovak dependency treebank.
- Named entity recognizer is trained separately on WikiAnn database.
- word form, POS+lemma, fasstext embeddings
- source: JÚĽŠ + SNK (prim), also from older prim-* corpora
- description: https://www.juls.savba.sk/semä.html
- source: Wikipedia, Common Crawl
- source: Common Crawl
- source: Wikipedia
- Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
- Language agnostic sentence embeddings.
- Multilingual document embeddings, based on Sentence Transformers.
- is a Slovak language version of the Mistral-7B-v0.1 large language model with 7 billion parameters.
- obtained by full parameter finetuning of the Mistral-7B-v0.1 large language model with the data from the Araneum Slovacum VII Maximum web corpus.
- Monolingual Slovak T5 model with 300 million parameters
- Trained from scratch on large web corpus
- Slovak RoBERTa base language model
- trained on web corpus
- Slovak BERT by Ardevop SK
Slovak T5 small, created by fine-tuning mT5 small.
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
- Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk
- multilingual BERT, trained on Wikipedia
- Bidirectional translation models for Slovak for multiple languages
- Also available for HF Transformers
- Contains SentencePiece tokenization models
- For MarianNMT
- English, German, Finish, French, Swedich,
- Multilingual translation model for Fairseq
- Provides also language detection models
- Original Fairseq REPO
- HuggingFace Transformers integration - distilled 600M version
- Uses T5 architecture
- https://arxiv.org/abs/2309.04662
- Supports 400 languages, including Slovak
- Previously used for Google Translate
- Multilingual translation model with Slovak support.
- Build for Fairseq
- HuggingFace Transformers model
- Flores101: Large-Scale Multilingual Machine Translation
- Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
- Includes Slovak language
- For fairseq
- Spelling Dictionary
- List of common names, abbreviations, pejoratives and neologisms.
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface, web service interface
- license: Apache v2.0
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/dyNET, command-line interface, web service interface
- license: Apache v2.0
- tokenization, stemming
- tokenization, segmentation
- implementation in C++
- license: GPL v3.0
- UPOS, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface
- license: MIT
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
- license: MPL v2.0
- tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
- web service interface only
- license: ?
- tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
- web interface at http://nlp.bednarik.top/
- Swagger REST API
- implementation in Java/DL4J
- source codes available
- license: GNU AGPLv3
- Web-based Visualisation of Slovak word vectors
- Lemmatization for 25 languages
- In Python
- Slovak trained on UDP corpus