Data and code for the paper "On the Limits of Minimal Pairs in Contrastive Evaluation" (BlackboxNLP 2021), containing contrastive translation pairs for targeted evaluation of English→German MT systems.
The evaluation protocol is identical to LingEval97 (https://github.com/rsennrich/lingeval97). The difference is that the target sequences of LingEval97 are human-written references, whereas DistilLingEval also provides contrastive test sets built from machine translations. A high-level explanation can be found in Jannis' blog, and a more detailed description in the paper.
The table below compares the phenomena or error types that are covered by the different test set variants:
Phenomenon | LingEval97 | DistilLingEval (this repo) | |
---|---|---|---|
Human references | Machine references | ||
auxiliary | ✓ | ||
compound | ✓ | ||
np_agreement | ✓ | ✓ | ✓ |
polarity_affix_del | ✓ | ✓ | ✓ |
polarity_affix_ins | ✓ | ||
polarity_particle_kein_del | ✓ | ✓ | ✓ |
polarity_particle_kein_ins | ✓ | ||
polarity_particle_nicht_del | ✓ | ✓ | ✓ |
polarity_particle_nicht_ins | ✓ | ||
subj_adequacy | ✓ | ||
subj_verb_agreement | ✓ | ✓ | ✓ |
transliteration | ✓ | ||
verb_particle | ✓ | ||
clause_omission | ✓ | ✓ | |
hypercorrect_genitive1, 2 | ✓ | ✓ | |
placeholder_ding1 | ✓ | ✓ |
Some remarks:
- The
hypercorrect_genitive
andplaceholder_ding
test sets are based on implausible hypotheses – they were created to demonstrate how human references and machine references can lead to different evaluation results. Beyond that, the two test sets are not too useful, since everything below very high accuracy would be a surprise. - The
hypercorrect_genitive
test sets are based on a variety of parallel corpora. - Otherwise, LingEval97 and DistilLingEval draw from the same distribution (wmt09–wmt16 test sets). However, LingEval97 and the human-reference variants of DistilLingEval do not overlap perfectly because different implementations have been used to select sentence pairs and to create contrastive variants.
- Use DistilLingEval with machine references if you are interested in the likely behavior of your system.
- Use DistilLingEval with human references to evaluate the robustness of your system against hypotheses or target contexts written by humans.
- Use LingEval97 to compare your system to previous work that reports LingEval97 results, or to analyze linguistic phenomena that are only covered by LingEval97.
- Requires Python >= 3.7
- Requires PyTorch (tested with 1.9.0)
pip install -r requirements.txt
- Optional dependencies for Fairseq models:
- fairseq==0.10.2
- fastBPE==0.1.0
- sacremoses==0.0.45
The code sample below uses an MT model trained with Fairseq v0.x (https://github.com/pytorch/fairseq).
However, it should be fairly easy to extend the code to another MT framework, by wrapping your model into a subclass of translation_models.TranslationModel
.
from pathlib import Path
from contrastive_evaluation import MTContrastiveEvaluationTask
from translation_models.fairseq_models import load_sota_model
# Warning: This will download a very large model from PyTorch Hub
model = load_sota_model()
testset_dir = Path("data") / "subj_verb_agreement.mt"
task = MTContrastiveEvaluationTask(
src_path=testset_dir / "src.en",
ref_path=testset_dir / "tgt.correct.de",
contrastive_path=testset_dir / "tgt.incorrect.de",
)
result = task.evaluate(model)
print(result)
- Use your MT system to score the translation variants in the data directory. For a given test set (e.g.,
subj_verb_agreement.mt
), write the scores line by line into a file, similar to the *.scores files in https://github.com/rsennrich/lingeval97/tree/master/baselines. The first half of the file should be the scores for the correct translation variants (tgt.correct.de), the second half for the incorrect ones (tgt.incorrect.de). - Run the following command with the filepath as an argument:
python contrastive_evaluation.py \
--testset-name subj_verb_agreement.mt \
--scores-path myoutput.scores
- Code: MIT License
- Data: Please refer to OPUS for the licenses of the
hypercorrect_genitive
data, and to the WMT19 shared task website for the license of the other data.
@inproceedings{vamvas-sennrich-2021-limits,
title = "On the Limits of Minimal Pairs in Contrastive Evaluation",
author = "Vamvas, Jannis and
Sennrich, Rico",
booktitle = "Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.blackboxnlp-1.5",
pages = "58--68",
}