This repository contains experiments related to dense representations of sentences in Polish. It includes code for evaluating different sentence representation methods such as aggregated word embeddings or neural sentence encoders, both multilingual and language-specific. This source code has been used in the following publications:
[1] Evaluation of Sentence Representations in Polish
The paper contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks. Dataset for these tasks are distributed with the repository and two of them are released specifically for this evaluation: the SICK (Sentences Involving Compositional Knowledge) corpus translated to Polish and 8TAGS classification dataset. Pre-trained models used in this study are available for download in separate repository: Polish NLP Resources.
BibTeX
@inproceedings{dadas-etal-2020-evaluation,
title = "Evaluation of Sentence Representations in {P}olish",
author = "Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.207",
pages = "1674--1680",
language = "English",
ISBN = "979-10-95546-34-4",
}
[2] Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases
In this publication, we show a simple method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer.
BibTeX
@inproceedings{9945218,
author={Dadas, S{\l}awomir},
booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases},
year={2022},
volume={},
number={},
pages={371-378},
doi={10.1109/SMC53654.2022.9945218}
}
- 29.12.2022 - Our supervised datasets are now available on the Huggingface Hub.
- 20.01.2022 - New code example added: training sentence encoders on paraphrase pairs mined from OPUS parallel corpus.
- 23.10.2020 - Added pre-trained multilingual models from the Sentence-Transformers library
- 02.09.2020 - Added LaBSE multilingual sentence encoder
- 09.05.2020 - Added new Polish RoBERTa models
- 03.03.2020 - Added XLM-RoBERTa (base) model
- 02.02.2020 - Added detailed results of static word embedding models with dimensionalities from 300 to 800
- 01.02.2020 - Added Polish RoBERTa model and multilingual XLM-RoBERTa (large) model
# | Method | Language | WCCRS Hotels |
WCCRS Medicine |
SICK‑E | SICK‑R | 8TAGS |
---|---|---|---|---|---|---|---|
Word embeddings | |||||||
1 | Random | n/a | 65.83 | 60.64 | 72.77 | 0.628 | 31.95 |
2.a | Word2Vec (300d) | Polish | 78.19 | 73.23 | 75.42 | 0.746 | 70.27 |
2.b | Word2Vec (500d) | Polish | 81.72 | 73.98 | 76.25 | 0.764 | 70.56 |
2.c | Word2Vec (800d) | Polish | 82.24 | 73.88 | 75.60 | 0.772 | 70.79 |
3.a | GloVe (300d) | Polish | 80.05 | 72.54 | 73.81 | 0.756 | 69.78 |
3.b | GloVe (500d) | Polish | 80.76 | 72.54 | 75.09 | 0.761 | 70.27 |
3.c | GloVe (800d) | Polish | 81.79 | 74.32 | 76.48 | 0.779 | 70.63 |
4.a | FastText (300d) | Polish | 80.31 | 72.64 | 75.19 | 0.729 | 69.24 |
4.b | FastText (500d) | Polish | 80.31 | 73.88 | 76.66 | 0.755 | 70.22 |
4.c | FastText (800d) | Polish | 80.95 | 72.94 | 77.09 | 0.768 | 69.95 |
Language models | |||||||
5.a | ELMo (all) | Polish | 85.52 | 78.42 | 77.15 | 0.789 | 71.41 |
5.b | ELMo (top) | Polish | 83.20 | 78.17 | 74.05 | 0.756 | 71.41 |
6 | Flair | Polish | 80.82 | 75.46 | 78.43 | 0.743 | 65.62 |
7.a | RoBERTa-base (all) | Polish | 85.78 | 78.96 | 78.82 | 0.799 | 70.27 |
7.b | RoBERTa-base (top) | Polish | 84.62 | 79.36 | 76.09 | 0.750 | 70.33 |
7.c | RoBERTa-large (all) | Polish | 89.12 | 84.74 | 78.13 | 0.820 | 75.75 |
7.d | RoBERTa-large (top) | Polish | 88.93 | 83.11 | 75.56 | 0.767 | 76.67 |
8.a | XLM-RoBERTa-base (all) | Multilingual | 85.52 | 78.81 | 75.25 | 0.734 | 68.78 |
8.b | XLM-RoBERTa-base (top) | Multilingual | 82.37 | 75.26 | 64.47 | 0.579 | 69.81 |
8.c | XLM-RoBERTa-large (all) | Multilingual | 87.39 | 83.60 | 74.34 | 0.764 | 73.33 |
8.d | XLM-RoBERTa-large (top) | Multilingual | 85.07 | 78.91 | 61.50 | 0.568 | 73.35 |
9 | BERT | Multilingual | 76.83 | 72.54 | 73.83 | 0.698 | 65.05 |
Sentence encoders | |||||||
10 | LASER | Multilingual | 81.21 | 78.17 | 82.21 | 0.825 | 64.91 |
11 | USE | Multilingual | 79.47 | 73.78 | 82.14 | 0.833 | 69.92 |
12 | LaBSE | Multilingual | 85.52 | 80.89 | 81.57 | 0.825 | 72.35 |
13a | Sentence-Transformers (distiluse-base-multilingual-cased-v2) | Multilingual | 79.99 | 75.80 | 78.90 | 0.807 | 70.86 |
13b | Sentence-Transformers (xlm-r-distilroberta-base-paraphrase-v1) | Multilingual | 82.63 | 80.84 | 81.35 | 0.839 | 70.61 |
13c | Sentence-Transformers (xlm-r-bert-base-nli-stsb-mean-tokens) | Multilingual | 81.02 | 79.95 | 79.09 | 0.820 | 69.12 |
13d | Sentence-Transformers (distilbert-multilingual-nli-stsb-quora-ranking) | Multilingual | 80.05 | 74.64 | 79.41 | 0.817 | 69.28 |
Table: Evaluation of sentence representations on four classification tasks and one semantic relatedness task (SICK-R). For classification, we report accuracy of each model. For semantic relatedness, Pearson correlation between true and predicted relatedness scores is reported.
- Randomly initialized word embeddings
- Word2Vec (Distributed Representations of Words and Phrases and their Compositionality) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
- GloVe (Glove: Global Vectors for Word Representation) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings. [Download]
- FastText (Enriching Word Vectors with Subword Information) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
- ELMo language model described in Deep contextualized word representations paper, pre-trained by us for Polish. In the
all
variant, we construct the word representation by concatenating all hidden states of the LM. In thetop
variant, only the top LM layer is used as word representation. [Download] - Flair language model described in Contextual String Embeddings for Sequence Labeling. We concatenate the outputs of the original
pl-forward
andpl-backward
pre-trained language models available in the Flair framework. - RoBERTa language model described in RoBERTa: A Robustly Optimized BERT Pretraining Approach, pre-trained by us for Polish. [Download]
- XLM-RoBERTa is a large, multilingual language model trained by Facebook on 2.5 TB of text extracted from CommonCrawl. We evaluate two pre-trained architectures: base and large model. More information in their paper Unsupervised Cross-lingual Representation Learning at Scale. [Download]
- Original BERT language model by Google described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. We use the
bert-base-multilingual-cased
version. [Download] - Multilingual sentence encoder by Facebook, presented in Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. [Download]
- Multilingual sentence encoder by Google, presented in Multilingual Universal Sentence Encoder for Semantic Retrieval.
- The language-agnostic BERT sentence embedding (LaBSE).
- Pre-trained models from the Sentence-Transformers library.
Figure: Evaluation of aggregation techniques for word embedding models with different dimensionalities. Baseline models use simple averaging, SIF is a method proposed by Arora et al. (2017), Max Pooling is a concatenation of arithmetic mean and max pooled vector from word embeddings.
evaluate_all.py
is used for evaluation of all available models.
Run evaluate.py [model_name] [model_params]
to evaluate single model. For example, evaluate.py word2vec
runs evaluation on word2vec_100_3_polish.bin
model.
Please note that in case of static embeddings and ELMo, you need to manually download the model from Polish NLP Resources and place it in the resources
directory.
This evaluation is based on SentEval modified by us to support models, tasks and preprocessing for Polish language. We'd like to thank authors of SentEval toolkit for making their code available.
Two tasks in this study are based on Wroclaw Corpus of Consumer Reviews. We would like to thank the authors for making this data collection available.