Skip to content

sdadas/polish-sentence-evaluation

Repository files navigation

Evaluation of Sentence Representations in Polish

This repository contains experiments related to dense representations of sentences in Polish. It includes code for evaluating different sentence representation methods such as aggregated word embeddings or neural sentence encoders, both multilingual and language-specific. This source code has been used in the following publications:

[1] Evaluation of Sentence Representations in Polish

The paper contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks. Dataset for these tasks are distributed with the repository and two of them are released specifically for this evaluation: the SICK (Sentences Involving Compositional Knowledge) corpus translated to Polish and 8TAGS classification dataset. Pre-trained models used in this study are available for download in separate repository: Polish NLP Resources.

BibTeX
@inproceedings{dadas-etal-2020-evaluation,
  title = "Evaluation of Sentence Representations in {P}olish",
  author = "Dadas, Slawomir  and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",
  booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
  month = may,
  year = "2020",
  address = "Marseille, France",
  publisher = "European Language Resources Association",
  url = "https://aclanthology.org/2020.lrec-1.207",
  pages = "1674--1680",
  language = "English",
  ISBN = "979-10-95546-34-4",
}

[2] Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

In this publication, we show a simple method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer.

BibTeX
@inproceedings{9945218,
  author={Dadas, S{\l}awomir},
  booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, 
  title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases}, 
  year={2022},
  volume={},
  number={},
  pages={371-378},
  doi={10.1109/SMC53654.2022.9945218}
}

Updates:

  • 29.12.2022 - Our supervised datasets are now available on the Huggingface Hub.
  • 20.01.2022 - New code example added: training sentence encoders on paraphrase pairs mined from OPUS parallel corpus.
  • 23.10.2020 - Added pre-trained multilingual models from the Sentence-Transformers library
  • 02.09.2020 - Added LaBSE multilingual sentence encoder
  • 09.05.2020 - Added new Polish RoBERTa models
  • 03.03.2020 - Added XLM-RoBERTa (base) model
  • 02.02.2020 - Added detailed results of static word embedding models with dimensionalities from 300 to 800
  • 01.02.2020 - Added Polish RoBERTa model and multilingual XLM-RoBERTa (large) model

Evaluation results:

# Method Language WCCRS
Hotels
WCCRS
Medicine
SICK‑E SICK‑R 8TAGS
Word embeddings
1Randomn/a65.8360.6472.770.62831.95
2.aWord2Vec (300d)Polish78.1973.2375.420.74670.27
2.bWord2Vec (500d)Polish81.7273.9876.250.76470.56
2.cWord2Vec (800d)Polish82.2473.8875.600.77270.79
3.aGloVe (300d)Polish80.0572.5473.810.75669.78
3.bGloVe (500d)Polish80.7672.5475.090.76170.27
3.cGloVe (800d)Polish81.7974.3276.480.77970.63
4.aFastText (300d)Polish80.3172.6475.190.72969.24
4.bFastText (500d)Polish80.3173.8876.660.75570.22
4.cFastText (800d)Polish80.9572.9477.090.76869.95
Language models
5.aELMo (all)Polish85.5278.4277.150.78971.41
5.bELMo (top)Polish83.2078.1774.050.75671.41
6FlairPolish80.8275.4678.430.74365.62
7.aRoBERTa-base (all)Polish85.7878.9678.820.79970.27
7.bRoBERTa-base (top)Polish84.6279.3676.090.75070.33
7.cRoBERTa-large (all)Polish89.1284.7478.130.82075.75
7.dRoBERTa-large (top)Polish88.9383.1175.560.76776.67
8.aXLM-RoBERTa-base (all)Multilingual85.5278.8175.250.73468.78
8.bXLM-RoBERTa-base (top)Multilingual82.3775.2664.470.57969.81
8.cXLM-RoBERTa-large (all)Multilingual87.3983.6074.340.76473.33
8.dXLM-RoBERTa-large (top)Multilingual85.0778.9161.500.56873.35
9BERTMultilingual76.8372.5473.830.69865.05
Sentence encoders
10LASERMultilingual81.2178.1782.210.82564.91
11USEMultilingual79.4773.7882.140.83369.92
12LaBSEMultilingual85.5280.8981.570.82572.35
13aSentence-Transformers
(distiluse-base-multilingual-cased-v2)
Multilingual79.9975.8078.900.80770.86
13bSentence-Transformers
(xlm-r-distilroberta-base-paraphrase-v1)
Multilingual82.6380.8481.350.83970.61
13cSentence-Transformers
(xlm-r-bert-base-nli-stsb-mean-tokens)
Multilingual81.0279.9579.090.82069.12
13dSentence-Transformers
(distilbert-multilingual-nli-stsb-quora-ranking)
Multilingual80.0574.6479.410.81769.28

Table: Evaluation of sentence representations on four classification tasks and one semantic relatedness task (SICK-R). For classification, we report accuracy of each model. For semantic relatedness, Pearson correlation between true and predicted relatedness scores is reported.

Evaluated methods:

  1. Randomly initialized word embeddings
  2. Word2Vec (Distributed Representations of Words and Phrases and their Compositionality) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
  3. GloVe (Glove: Global Vectors for Word Representation) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings. [Download]
  4. FastText (Enriching Word Vectors with Subword Information) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
  5. ELMo language model described in Deep contextualized word representations paper, pre-trained by us for Polish. In the all variant, we construct the word representation by concatenating all hidden states of the LM. In the top variant, only the top LM layer is used as word representation. [Download]
  6. Flair language model described in Contextual String Embeddings for Sequence Labeling. We concatenate the outputs of the original pl-forward and pl-backward pre-trained language models available in the Flair framework.
  7. RoBERTa language model described in RoBERTa: A Robustly Optimized BERT Pretraining Approach, pre-trained by us for Polish. [Download]
  8. XLM-RoBERTa is a large, multilingual language model trained by Facebook on 2.5 TB of text extracted from CommonCrawl. We evaluate two pre-trained architectures: base and large model. More information in their paper Unsupervised Cross-lingual Representation Learning at Scale. [Download]
  9. Original BERT language model by Google described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. We use the bert-base-multilingual-cased version. [Download]
  10. Multilingual sentence encoder by Facebook, presented in Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. [Download]
  11. Multilingual sentence encoder by Google, presented in Multilingual Universal Sentence Encoder for Semantic Retrieval.
  12. The language-agnostic BERT sentence embedding (LaBSE).
  13. Pre-trained models from the Sentence-Transformers library.

results

Figure: Evaluation of aggregation techniques for word embedding models with different dimensionalities. Baseline models use simple averaging, SIF is a method proposed by Arora et al. (2017), Max Pooling is a concatenation of arithmetic mean and max pooled vector from word embeddings.

Usage

evaluate_all.py is used for evaluation of all available models.
Run evaluate.py [model_name] [model_params] to evaluate single model. For example, evaluate.py word2vec runs evaluation on word2vec_100_3_polish.bin model. Please note that in case of static embeddings and ELMo, you need to manually download the model from Polish NLP Resources and place it in the resources directory.

Acknowledgements

This evaluation is based on SentEval modified by us to support models, tasks and preprocessing for Polish language. We'd like to thank authors of SentEval toolkit for making their code available.

Two tasks in this study are based on Wroclaw Corpus of Consumer Reviews. We would like to thank the authors for making this data collection available.