GitHub - sdadas/polish-sentence-evaluation: Evaluation of Sentence Representations in Polish

Evaluation of Sentence Representations in Polish

This repository contains experiments related to dense representations of sentences in Polish. It includes code for evaluating different sentence representation methods such as aggregated word embeddings or neural sentence encoders, both multilingual and language-specific. This source code has been used in the following publications:

[1] Evaluation of Sentence Representations in Polish

The paper contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks. Dataset for these tasks are distributed with the repository and two of them are released specifically for this evaluation: the SICK (Sentences Involving Compositional Knowledge) corpus translated to Polish and 8TAGS classification dataset. Pre-trained models used in this study are available for download in separate repository: Polish NLP Resources.

BibTeX

@inproceedings{dadas-etal-2020-evaluation,
  title = "Evaluation of Sentence Representations in {P}olish",
  author = "Dadas, Slawomir  and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",
  booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
  month = may,
  year = "2020",
  address = "Marseille, France",
  publisher = "European Language Resources Association",
  url = "https://aclanthology.org/2020.lrec-1.207",
  pages = "1674--1680",
  language = "English",
  ISBN = "979-10-95546-34-4",
}

[2] Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

In this publication, we show a simple method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer.

BibTeX

@inproceedings{9945218,
  author={Dadas, S{\l}awomir},
  booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, 
  title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases}, 
  year={2022},
  volume={},
  number={},
  pages={371-378},
  doi={10.1109/SMC53654.2022.9945218}
}

Updates:

29.12.2022 - Our supervised datasets are now available on the Huggingface Hub.
20.01.2022 - New code example added: training sentence encoders on paraphrase pairs mined from OPUS parallel corpus.
23.10.2020 - Added pre-trained multilingual models from the Sentence-Transformers library
02.09.2020 - Added LaBSE multilingual sentence encoder
09.05.2020 - Added new Polish RoBERTa models
03.03.2020 - Added XLM-RoBERTa (base) model
02.02.2020 - Added detailed results of static word embedding models with dimensionalities from 300 to 800
01.02.2020 - Added Polish RoBERTa model and multilingual XLM-RoBERTa (large) model

Evaluation results:

#	Method	Language	WCCRS Hotels	WCCRS Medicine	SICK‑E	SICK‑R	8TAGS
Word embeddings
1	Random	n/a	65.83	60.64	72.77	0.628	31.95
2.a	Word2Vec (300d)	Polish	78.19	73.23	75.42	0.746	70.27
2.b	Word2Vec (500d)	Polish	81.72	73.98	76.25	0.764	70.56
2.c	Word2Vec (800d)	Polish	82.24	73.88	75.60	0.772	70.79
3.a	GloVe (300d)	Polish	80.05	72.54	73.81	0.756	69.78
3.b	GloVe (500d)	Polish	80.76	72.54	75.09	0.761	70.27
3.c	GloVe (800d)	Polish	81.79	74.32	76.48	0.779	70.63
4.a	FastText (300d)	Polish	80.31	72.64	75.19	0.729	69.24
4.b	FastText (500d)	Polish	80.31	73.88	76.66	0.755	70.22
4.c	FastText (800d)	Polish	80.95	72.94	77.09	0.768	69.95
Language models
5.a	ELMo (all)	Polish	85.52	78.42	77.15	0.789	71.41
5.b	ELMo (top)	Polish	83.20	78.17	74.05	0.756	71.41
6	Flair	Polish	80.82	75.46	78.43	0.743	65.62
7.a	RoBERTa-base (all)	Polish	85.78	78.96	78.82	0.799	70.27
7.b	RoBERTa-base (top)	Polish	84.62	79.36	76.09	0.750	70.33
7.c	RoBERTa-large (all)	Polish	89.12	84.74	78.13	0.820	75.75
7.d	RoBERTa-large (top)	Polish	88.93	83.11	75.56	0.767	76.67
8.a	XLM-RoBERTa-base (all)	Multilingual	85.52	78.81	75.25	0.734	68.78
8.b	XLM-RoBERTa-base (top)	Multilingual	82.37	75.26	64.47	0.579	69.81
8.c	XLM-RoBERTa-large (all)	Multilingual	87.39	83.60	74.34	0.764	73.33
8.d	XLM-RoBERTa-large (top)	Multilingual	85.07	78.91	61.50	0.568	73.35
9	BERT	Multilingual	76.83	72.54	73.83	0.698	65.05
Sentence encoders
10	LASER	Multilingual	81.21	78.17	82.21	0.825	64.91
11	USE	Multilingual	79.47	73.78	82.14	0.833	69.92
12	LaBSE	Multilingual	85.52	80.89	81.57	0.825	72.35
13a	Sentence-Transformers ^{(distiluse-base-multilingual-cased-v2)}	Multilingual	79.99	75.80	78.90	0.807	70.86
13b	Sentence-Transformers ^{(xlm-r-distilroberta-base-paraphrase-v1)}	Multilingual	82.63	80.84	81.35	0.839	70.61
13c	Sentence-Transformers ^{(xlm-r-bert-base-nli-stsb-mean-tokens)}	Multilingual	81.02	79.95	79.09	0.820	69.12
13d	Sentence-Transformers ^{(distilbert-multilingual-nli-stsb-quora-ranking)}	Multilingual	80.05	74.64	79.41	0.817	69.28

Table: Evaluation of sentence representations on four classification tasks and one semantic relatedness task (SICK-R). For classification, we report accuracy of each model. For semantic relatedness, Pearson correlation between true and predicted relatedness scores is reported.

Evaluated methods:

Randomly initialized word embeddings
Word2Vec (Distributed Representations of Words and Phrases and their Compositionality) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
GloVe (Glove: Global Vectors for Word Representation) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings. [Download]
FastText (Enriching Word Vectors with Subword Information) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
ELMo language model described in Deep contextualized word representations paper, pre-trained by us for Polish. In the all variant, we construct the word representation by concatenating all hidden states of the LM. In the top variant, only the top LM layer is used as word representation. [Download]
Flair language model described in Contextual String Embeddings for Sequence Labeling. We concatenate the outputs of the original pl-forward and pl-backward pre-trained language models available in the Flair framework.
RoBERTa language model described in RoBERTa: A Robustly Optimized BERT Pretraining Approach, pre-trained by us for Polish. [Download]
XLM-RoBERTa is a large, multilingual language model trained by Facebook on 2.5 TB of text extracted from CommonCrawl. We evaluate two pre-trained architectures: base and large model. More information in their paper Unsupervised Cross-lingual Representation Learning at Scale. [Download]
Original BERT language model by Google described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. We use the bert-base-multilingual-cased version. [Download]
Multilingual sentence encoder by Facebook, presented in Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. [Download]
Multilingual sentence encoder by Google, presented in Multilingual Universal Sentence Encoder for Semantic Retrieval.
The language-agnostic BERT sentence embedding (LaBSE).
Pre-trained models from the Sentence-Transformers library.

Figure: Evaluation of aggregation techniques for word embedding models with different dimensionalities. Baseline models use simple averaging, SIF is a method proposed by Arora et al. (2017), Max Pooling is a concatenation of arithmetic mean and max pooled vector from word embeddings.

Usage

evaluate_all.py is used for evaluation of all available models.
Run evaluate.py [model_name] [model_params] to evaluate single model. For example, evaluate.py word2vec runs evaluation on word2vec_100_3_polish.bin model. Please note that in case of static embeddings and ELMo, you need to manually download the model from Polish NLP Resources and place it in the resources directory.

Acknowledgements

This evaluation is based on SentEval modified by us to support models, tasks and preprocessing for Polish language. We'd like to thank authors of SentEval toolkit for making their code available.

Two tasks in this study are based on Wroclaw Corpus of Consumer Reviews. We would like to thank the authors for making this data collection available.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
examples/paraphrase_mining		examples/paraphrase_mining
extensions		extensions
methods		methods
resources/downstream		resources/downstream
sentevalpl		sentevalpl
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluate_all.py		evaluate_all.py
requirements.txt		requirements.txt
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Sentence Representations in Polish

[1] Evaluation of Sentence Representations in Polish

[2] Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

Updates:

Evaluation results:

Evaluated methods:

Usage

Acknowledgements

About

Releases 2

Packages

Languages

License

sdadas/polish-sentence-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Sentence Representations in Polish

[1] Evaluation of Sentence Representations in Polish

[2] Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

Updates:

Evaluation results:

Evaluated methods:

Usage

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages