resources/MODELS.md at master · andrejridzik/resources · GitHub

Models and Tools

General Models

Slovak Spacy Model

Contains Floret Word Vectors.
Tagger module uses Slovak National Corpus Tagset.
Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
Lemmatizer is trained on Slovak dependency treebank.
Named entity recognizer is trained separately on WikiAnn database.

Word embeddings

JÚĽŠ word embeddings

word form, POS+lemma, fasstext embeddings
source: JÚĽŠ + SNK (prim), also from older prim-* corpora
description: https://www.juls.savba.sk/semä.html

ELMo word embeddings

source: Wikipedia, Common Crawl

fastText word embeddings - Common Crawl

source: Common Crawl

fastText word embeddings - Wikipedia

source: Wikipedia

Document Embeddings

Language Agnostic BERT model

Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

LASER

Language agnostic sentence embeddings.

E5

Multilingual document embeddings, based on Sentence Transformers.

Transformers

Slovak Mistral 7B

is a Slovak language version of the Mistral-7B-v0.1 large language model with 7 billion parameters.
obtained by full parameter finetuning of the Mistral-7B-v0.1 large language model with the data from the Araneum Slovacum VII Maximum web corpus.

Slovak T5 Base

Monolingual Slovak T5 model with 300 million parameters
Trained from scratch on large web corpus

SlovakBert

Slovak RoBERTa base language model
trained on web corpus

sk-bert

Slovak BERT by Ardevop SK

ApoTro/slovak-t5-small

Slovak T5 small, created by fine-tuning mT5 small.

VoxPopuli Slovak model

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk

m-BERT

multilingual BERT, trained on Wikipedia

Translation models

Helsinki Opus NLP

Bidirectional translation models for Slovak for multiple languages
Also available for HF Transformers
Contains SentencePiece tokenization models
For MarianNMT
English, German, Finish, French, Swedich,

NLLB - No Langauge Left Behind

Multilingual translation model for Fairseq
Provides also language detection models
Original Fairseq REPO
HuggingFace Transformers integration - distilled 600M version

MadLad400

Uses T5 architecture
https://arxiv.org/abs/2309.04662
Supports 400 languages, including Slovak
Previously used for Google Translate

M2M 100

Multilingual translation model with Slovak support.
Build for Fairseq
HuggingFace Transformers model

Flores

Flores101: Large-Scale Multilingual Machine Translation
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
Includes Slovak language
For fairseq

Tools and demos

Slovak Hunspell

Spelling Dictionary
List of common names, abbreviations, pejoratives and neologisms.

Stanza

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/PyTorch, command-line interface, web service interface
license: Apache v2.0

NLP-Cube

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/dyNET, command-line interface, web service interface
license: Apache v2.0

Slovak Elasticsearch

tokenization, stemming

Slovak lexer

tokenization, segmentation
implementation in C++
license: GPL v3.0

dl4dp

UPOS, UD
models trained on UD
implementation in Python/PyTorch, command-line interface
license: MIT

UDPipe

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
license: MPL v2.0

NLP4SK

tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
web service interface only
license: ?

NLP Tools

tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
web interface at http://nlp.bednarik.top/
Swagger REST API
implementation in Java/DL4J
source codes available
license: GNU AGPLv3

Semä

Web-based Visualisation of Slovak word vectors

Simplemma

Lemmatization for 25 languages
In Python
Slovak trained on UDP corpus