The goal of this repository is to provide useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.
The goal of evaluating Machine Translation quality have several complexities, like finding suitable test data, agreeing on the metric, attaching different translation engines and so on.
Our own goal was to build a system that allows us to compare several MT engines between each other on variety on datasets and to be able to repeat evaluation with some constant frequency. Surprisingly we cannot find any existing open source solution that will fit our requirements. So we created our own solution and decided to publish it in a case somebody will find it useful.
We're also providing current NMT evaluation results that we obtained during our own evaluation.
First clone our repository, cd to the root folder and install all the required libraries
pip install -r requirements.txt
Evaluation happens in 3 stages:
- Download and prepare test dataset
- Translate all datasets with required MT engines
- Evaluate translations by required metrics
All evaluations scripts are stored in evaluation folder.
# Scripts are called as modules (python -m ...) rather then scripts
# to simplify import for siblings folders
python -m evaluation.01_download_datasets
python -m evaluation.02_translate_datasets
python -m evaluation.03_evaluate_translations
# Final evaluation results will be in benchmarks folder:
# benchmarks/main_evaluation_set.tsv
# benchmarks/main_evaluation_set.tsv
# Careful – it took about a day to run the whole evaluation, even on a good GPU machine
# Careful – if you are providing your own keys for cloud services it will cost you money (30$ for Azure and 50$ for Google for this evaluation).
These three commands will download the same datasets that we use, and evaluate them against the same engines that we evaluated
-
Create your own yaml config in configs/dataset_to_download
Config file should contain source and target language and list of datasets that we want to download from OPUS
source_language: en
target_language: es
datasets:
# Name of the dataset should be one from
# https://opus.nlpl.eu/opusapi/?corpora=True
- name: TED2013
# Type is your own tag to differntiate between datasets
type: General
# test, dev, train are fixed subsets
# You can skip any of them in the config but cannot add your own
# We guarantee that there are no data duplicates between these three sets
test:
# Required number of lines in the dataset
# Actual size may be smaller due to removal of duplicates and short lines
size: 10
# Do we allow duplicates in the dataset
no_duplicates: true
# Minimal length in characters of the string to be included in the dataset
min_str_length: 30
dev:
size: 100
no_duplicates: true
min_str_length: 30
train:
size: 100
no_duplicates: true
min_str_length: 30
- Run evaluation scripts specifying your config file
# Using quick_evaluation_set.yaml as an example
# configs for translation and evaluation steps will be created automatically
python -m evaluation.01_download_datasets datasets_to_download=quick_evaluation_set
python -m evaluation.02_translate_datasets datasets_to_translate=quick_evaluation_set
python -m evaluation.03_evaluate_translations datasets_to_evaluate=quick_evaluation_set
# Final evaluation results will be in benchmarks folder:
# benchmarks/quick_evaluation_set.tsv
# benchmarks/quick_evaluation_set.tsv
You may want to evaluate translation quality on your own dataset or public data not from OPUS.
-
Create two parallel files from your dataset, one with source data and one with reference translated data.
-
Create your own yaml config in configs/dataset_to_translate
Config file should contain source and target language and list of datasets that we want to translate.
source_language: en
target_language: es
datasets:
# For your own datasets you may use whatever name you want
- name: PrivateDataset
# Type is your own tag to differntiate between datasets
type: General
# test, dev, train are fixed subsets
# You can skip any of them in the config but cannot add your own
test:
# Path to source and reference file
# They should be either absolute or relative to the project root folder
source: NMT_datasets/en.es/PrivateDataset/test.en
target: NMT_datasets/en.es/PrivateDataset/test.es
# Number of lines in source and reference files
size: 9
dev:
source: NMT_datasets/en.es/PrivateDataset/dev.en
target: NMT_datasets/en.es/PrivateDataset/dev.es
size: 87
train:
source: NMT_datasets/en.es/PrivateDataset/train.en
target: NMT_datasets/en.es/PrivateDataset/train.es
size: 90
- Run evaluation scripts from the second script, specifying your config file
# Using my_private_dataset as an example
# config for evaluation step will be created automatically
python -m evaluation.02_translate_datasets datasets_to_translate=my_private_dataset
python -m evaluation.03_evaluate_translations datasets_to_evaluate=my_private_dataset
# Final evaluation results will be in benchmarks folder:
# benchmarks/my_private_dataset.tsv
# benchmarks/my_private_dataset.tsv
If your MT engine is callable from Python code you can add it to this evaluation framework
-
Create new class for your MT engine in translator folder.
translator/translator_empty.py can be used as reference.
from translator.translator_base import TranslatorBase
# Empty translator to test overall translator architecture
class TranslatorEmpty( TranslatorBase ):
# required init params:
# max_batch_lines - how many lines your engine can translate in one go
# max_batch_chars - how many characters should be in one translation batch
# max_file_lines - if we process text file,
# how many lines we can read from file in one go to split to batches
# (rule of thumb is 10*max_batch_lines)
# verbose - do we want info output or only errors
# You can add your own additional parameters here
def __init__( self,
max_batch_lines = 4, max_batch_chars = 1000, max_file_lines = 40,
verbose = False ):
super(TranslatorEmpty, self).__init__(
max_batch_lines = max_batch_lines, max_batch_chars = max_batch_chars, max_file_lines = max_file_lines,
verbose = verbose )
# Your translator object should be created here
self.logger.info(f'Created Empty translator engine')
# Main function to translate lines
# We guarantee that len(lines) <= max_batch_lines and
# len( ''.join(lines) ) <= max_batch_chars
def _translate_lines( self, lines ):
result = [line for line in lines]
return result
# Setting source and target language for next calls of _translate_lines
# Function is needed if we are using multilingual translation engines
def _set_language_pair( self, source_language, target_language ):
# Config your engine for the language pair
return
-
Add your engine to the enum in translator/create_translator.py
You can set any class name for the engine, and value should be full path of your translator class
class Translator(Enum):
Custom = 'translator.translator_custom.TranslatorCustom'
...
- Add your engine to the configs/translation_engines/engines.yaml config
# Class name that you set on previous step
- class: Custom
# Any name that you want that will specify for you this MT engine
# with current settings
name: Custom.option.42
# Type is your own tag to differntiate between engines
type: CustomEngine
settings:
# Any settings that you need for the engine
# they will be passed by name to your engine __init__ function
option: 42
-
Run evaluation scripts as usual
Or alternatively, on step 3, you can create your own engine config file and specify it while calling translation script
# Using my_private_dataset and my_private_engine as an example
# config for evaluation step will be created automatically
python -m evaluation.02_translate_datasets datasets_to_translate=my_private_dataset translation_engines=my_private_engine
python -m evaluation.03_evaluate_translations datasets_to_evaluate=my_private_dataset
# Final evaluation results will be in benchmarks folder:
# benchmarks/my_private_dataset.tsv
# benchmarks/my_private_dataset.tsv
We evaluated several translation engines for English-to-Spanish translation.
All evaluation results presented here is valid on 7th of July 2022, with the libraries' version defined in requirements.txt and datasets and models downloaded on that day.
Please keep in mind that if you will try to reproduce our results on later date, they may change due to updated models, libraries and cloud MT engines.
-
Azure MT engine https://azure.microsoft.com/en-us/services/cognitive-services/translator/
For Azure MT engine to work you will need your own subscription for Azure MT service, and you will need to provide the key for this service in AZURE_MT_KEY environment variable.
-
Google MT engine https://cloud.google.com/translate/
For Google MT engine to work you will need your own subscription for Google MT service. You will need to create service account for your service and download key.json file for it. And you will need to provide path to this file in GOOGLE_APPLICATION_CREDENTIALS environment variable.
More information on using Google Cloud with service account can be found here: https://cloud.google.com/translate/docs/setup#creating_service_accounts_and_keys
-
Marian NMT + Opus MT
Marian MT is very efficient machine translation architecture. https://marian-nmt.github.io/
Opus MT is a set of Marian models trained on public data from OPUS. https://github.com/Helsinki-NLP/Opus-MT
In our evaluation we used Opus models converted to PyTorch and included in transformers library. https://huggingface.co/transformers/model_doc/marian.html
-
NeMo Machine Translation
NeMo is NVIDIA toolkit for conversational AI models. It includes a bunch of pre-trained models for different tasks including Machine Translation. https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation.html
For NeMo we evaluated two models, large (24 encoder layers and 6 decoder layers) and small (12 encoder layers and 2 decoder layers)
-
M2M100 and MBart50 is two massive multi-lingual models from the Facebook Research. They support translation between 100 and 50 languages respectively. Usually declared strength of such models is ability to work on low-resource languages. But we included them in our evaluation to see how well they can perform on common English-Spanish translation pair.
For both models we used their PyTorch Transformers versions.
M2M100 description: Beyond English-Centric Multilingual Machine Translation
We evaluated two M2M100 model versions, with 418M and 1.2B paramaters:
MBart50 description: Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
We evaluated two MBart50 model versions, one truly multi-lingual with any possible transaltion direction and another trained to translate only from English:
Our work will be impossible without the OPUS project: https://opus.nlpl.eu/
Opus provide collections of different public translation datasets with the API that allows to search and download datasets in one common format.
For our evaluation we used 6 datasets available at Opus. For each dataset we tried to create test set of approximately 5000 lines.
Dataset | Test set size | Links | Description |
---|---|---|---|
EMEA | 4320 | https://opus.nlpl.eu/EMEA.php, http://www.emea.europa.eu/ | Parallel corpus made out of PDF documents from the European Medicines Agency |
WikiMatrix | 5761 | https://opus.nlpl.eu/WikiMatrix.php, https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix | Parallel corpora from Wikimedia compiled by Facebook Research |
TED2020 | 5249 | https://opus.nlpl.eu/TED2020.php, Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation | A crawl of nearly 4000 TED and TED-X transcripts from July 2020 |
OpenSubtitles | 3097 | https://opus.nlpl.eu/OpenSubtitles-v2018.php, http://www.opensubtitles.org/ | Collection of translated movie subtitles from opensubtitles.org |
EUbookshop | 5313 | https://opus.nlpl.eu/EUbookshop.php, Parallel Data, Tools and Interfaces in OPUS | Corpus of documents from the EU bookshop |
ParaCrawl | 5227 | https://opus.nlpl.eu/EMEA.php, http://paracrawl.eu/download.html | Parallel corpora from Web Crawls collected in the ParaCrawl project |
CCAligned | 5402 | https://opus.nlpl.eu/EMEA.php, CCAligned: A Massive Collection of Cross-lingual Web-Document Pairs | Parallel corpora from Commoncrawl Snapshots |
While BLEU considered most common metrics in Machine Translation, there are other options available that may be more tuned for different use cases.
In our evaluation we implemented several metrics to be able to compare machine translation engines across different dimension.
In our use case, it turned out that all metrics generally agree with each other (if one engine was better than the other by one metric, it was better by all metrics). For that reason, we use only BLEU to show our final evaluation results.
But it is important to notice here that in other uses cases (specifically for fine-grained machine translation quality comparison) other metrics may prove to be more useful.
Metric | Link | Description |
---|---|---|
BLEU | https://github.com/mjpost/sacrebleu | https://www.aclweb.org/anthology/P02-1040.pdf |
TER | https://github.com/mjpost/sacrebleu | https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf |
CHRF | https://github.com/mjpost/sacrebleu | https://aclanthology.org/W15-3049.pdf |
ROUGE | https://github.com/pltrdy/rouge | https://aclanthology.org/W04-1013.pdf |
BERTScore | https://github.com/Tiiiger/bert_score | https://arxiv.org/abs/1904.09675 |
COMET | https://github.com/Unbabel/COMET | https://aclanthology.org/2020.emnlp-main.213/ |
All results can be found in benchmarks/main_evaluation_set.(tsv|yaml|xlsx)
When we sort all MT engines by one metric (let's say BLEU) all other metrics also became sorted.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the Apache License 2.0. See LICENSE
for more information.
- Anton Masalovich
- GitHub: TonyMas
- Email: anton.masalovich@optum.com
This repo wouldn't be possible without: