Skip to content

Latest commit

 

History

History
194 lines (118 loc) · 6.16 KB

MODELS.md

File metadata and controls

194 lines (118 loc) · 6.16 KB

Models and Tools

General Models

  • Contains Floret Word Vectors.
  • Tagger module uses Slovak National Corpus Tagset.
  • Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
  • Lemmatizer is trained on Slovak dependency treebank.
  • Named entity recognizer is trained separately on WikiAnn database.

Word embeddings

  • source: Wikipedia, Common Crawl
  • source: Common Crawl
  • source: Wikipedia

Document Embeddings

  • Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
  • Language agnostic sentence embeddings.
  • Multilingual document embeddings, based on Sentence Transformers.

Transformers

  • is a Slovak language version of the Mistral-7B-v0.1 large language model with 7 billion parameters.
  • obtained by full parameter finetuning of the Mistral-7B-v0.1 large language model with the data from the Araneum Slovacum VII Maximum web corpus.
  • Monolingual Slovak T5 model with 300 million parameters
  • Trained from scratch on large web corpus
  • Slovak RoBERTa base language model
  • trained on web corpus
  • Slovak BERT by Ardevop SK

Slovak T5 small, created by fine-tuning mT5 small.

  • VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
  • Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk
  • multilingual BERT, trained on Wikipedia

Translation models

  • Bidirectional translation models for Slovak for multiple languages
  • Also available for HF Transformers
  • Contains SentencePiece tokenization models
  • For MarianNMT
  • English, German, Finish, French, Swedich,
  • Flores101: Large-Scale Multilingual Machine Translation
  • Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
  • Includes Slovak language
  • For fairseq

Tools and demos

  • Spelling Dictionary
  • List of common names, abbreviations, pejoratives and neologisms.
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in Python/PyTorch, command-line interface, web service interface
  • license: Apache v2.0
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in Python/dyNET, command-line interface, web service interface
  • license: Apache v2.0
  • tokenization, stemming
  • tokenization, segmentation
  • implementation in C++
  • license: GPL v3.0
  • UPOS, UD
  • models trained on UD
  • implementation in Python/PyTorch, command-line interface
  • license: MIT
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
  • license: MPL v2.0
  • tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
  • web service interface only
  • license: ?
  • tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
  • web interface at http://nlp.bednarik.top/
  • Swagger REST API
  • implementation in Java/DL4J
  • source codes available
  • license: GNU AGPLv3
  • Web-based Visualisation of Slovak word vectors
  • Lemmatization for 25 languages
  • In Python
  • Slovak trained on UDP corpus