Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.
Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.
- Python with langid.py
- PHP
- Moses scripts
- Subword NMT
pip install subword-nmt
pip install langid
If you use this tool, please cite the following paper:
Matīss Rikters (2018). "Impact of Corpora Quality on Neural Machine Translation." In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).
@inproceedings{Rikters2018BalticHLT,
author = {Rikters, Matīss},
booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
title = {{Impact of Corpora Quality on Neural Machine Translation}},
address={Tartu, Estonia},
year = {2018}
}