Benchmark testing embedding models in Azerbaijani for sentence similarity tasks. This project evaluates various pre-trained embedding models on multiple Azerbaijani sentence similarity datasets by computing cosine similarities and assessing performance using Pearson correlation.
Understanding sentence similarity is crucial for various Natural Language Processing (NLP) applications such as information retrieval, paraphrase detection, and semantic search. This project provides a benchmarking framework to evaluate different embedding models' effectiveness in capturing semantic similarities between Azerbaijani sentences.
- Support for Multiple Models: Evaluate a wide range of pre-trained embedding models.
- Multiple Datasets: Test models on various Azerbaijani sentence similarity datasets.
- Efficient Processing: Utilize batching and GPU acceleration for faster computations.
- Comprehensive Metrics: Calculate Pearson correlation to assess model performance.
- Easy Integration: Simple command-line interface for seamless benchmarking.
The benchmarking framework utilizes several Azerbaijani sentence similarity datasets. Each dataset contains pairs of sentences along with human-annotated similarity scores.
-
Azerbaijani STS Benchmark
- Identifier:
LocalDoc/Azerbaijani-STSBenchmark
- Description: A standard benchmark for sentence similarity in Azerbaijani, containing diverse sentence pairs.
- Identifier:
-
Azerbaijani BIOSSES STS
- Identifier:
LocalDoc/Azerbaijani-biosses-sts
- Description: Based on the BIOSSES dataset, adapted for Azerbaijani language.
- Identifier:
-
Azerbaijani SICK-R STS
- Identifier:
LocalDoc/Azerbaijani-sickr-sts
- Description: Adapted from the SICK-R dataset for Azerbaijani sentence similarity tasks.
- Identifier:
-
Azerbaijani STS12 STS
- Identifier:
LocalDoc/Azerbaijani-sts12-sts
- Description: Part of the STS12 series, tailored for Azerbaijani.
- Identifier:
-
Azerbaijani STS13 STS
- Identifier:
LocalDoc/Azerbaijani-sts13-sts
- Description: Part of the STS13 series, tailored for Azerbaijani.
- Identifier:
-
Azerbaijani STS15 STS
- Identifier:
LocalDoc/Azerbaijani-sts15-sts
- Description: Part of the STS15 series, tailored for Azerbaijani.
- Identifier:
-
Azerbaijani STS16 STS
- Identifier:
LocalDoc/Azerbaijani-sts16-sts
- Description: Part of the STS16 series, tailored for Azerbaijani.
- Identifier:
Each dataset should be structured in a format compatible with the Hugging Face datasets
library, containing:
sentence1
: The first sentence in the pair.sentence2
: The second sentence in the pair.score
: Human-annotated similarity score (usescaled_score
if applicable).
To evaluate different embedding models, provide a text file (models.txt
) listing the Hugging Face model names you wish to benchmark, one per line. Example:
The benchmarking script accepts several parameters to customize the evaluation process:
--models_file
: (Required) Path to the text file containing the list of model names to evaluate.--output
: (Optional) Output CSV file to save results. Defaults tobenchmark_results.csv
.--batch_size
: (Optional) Batch size for processing sentences. Defaults to32
.--device
: (Optional) Device to run the model on (cpu
orcuda
). Defaults tocpu
.
-
Clone the Repository
git clone https://github.com/LocalDoc-Azerbaijan/STS-Benchmark.git cd STS-Benchmark
-
Create a Virtual Environment
python3 -m venv venv source venv/bin/activate
-
Install Dependencies
To evaluate models on Azerbaijani sentence similarity datasets, follow these steps:
Create a models.txt
file where each line contains the Hugging Face model name you want to evaluate. Example:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
sentence-transformers/distiluse-base-multilingual-cased-v1
bert-base-multilingual-cased
xlm-roberta-base
python main.py --models_file models.txt --output results.csv --batch_size 32 --device cuda
--models_file
: (Required) Path to themodels.txt
file containing a list of model names to evaluate.--output
: (Optional) Name of the output CSV file to save results. Defaults tobenchmark_results.csv
.--batch_size
: (Optional) Batch size for processing sentences. Larger sizes may improve processing speed but require more memory. Defaults to32
.--device
: (Optional) Device to run the model on (cpu
orcuda
). Usecuda
to enable GPU acceleration if available.
To run the benchmark on GPU with a batch size of 64 and save results to custom_results.csv
, use the following command:
python main.py --models_file models.txt --output custom_results.csv --batch_size 64 --device cuda
Once the benchmarking completes, the results are saved in the specified output file (default is results.csv
). This file contains the Pearson correlation scores for each model-dataset pair along with an average score for each model.
STSBenchmark | biosses-sts | sickr-sts | sts12-sts | sts13-sts | sts15-sts | sts16-sts | Average Pearson | Model |
---|---|---|---|---|---|---|---|---|
0.7363 | 0.8148 | 0.7067 | 0.7050 | 0.6535 | 0.7514 | 0.7070 | 0.7250 | sentence-transformers/LaBSE |
0.5830 | 0.2486 | 0.5921 | 0.5593 | 0.5559 | 0.5404 | 0.5289 | 0.5155 | antoinelouis/colbert-xm |
0.7572 | 0.8139 | 0.7328 | 0.7646 | 0.6318 | 0.7542 | 0.7092 | 0.7377 | intfloat/multilingual-e5-large-instruct |
0.7485 | 0.7714 | 0.7271 | 0.7170 | 0.6496 | 0.7570 | 0.7255 | 0.7280 | intfloat/multilingual-e5-large |
0.6960 | 0.8185 | 0.6950 | 0.6752 | 0.5899 | 0.7186 | 0.6790 | 0.6960 | intfloat/multilingual-e5-base |
0.7376 | 0.7917 | 0.7190 | 0.7441 | 0.6286 | 0.7461 | 0.7026 | 0.7242 | intfloat/multilingual-e5-small |
0.7927 | 0.6672 | 0.7758 | 0.8122 | 0.7312 | 0.7831 | 0.7416 | 0.7577 | BAAI/bge-m3 |
- Individual Scores: Each dataset column represents the Pearson correlation score between predicted and true similarity scores for the given model.
- Average Pearson: This column shows the average Pearson correlation score across all datasets for each model, providing an overall measure of performance.
Use these scores to compare the effectiveness of different models in capturing sentence similarity for Azerbaijani text.