Corpora Cleaning Tools

Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.

Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.

Tools included

parallel - tools for parallel corpora
mono - tools for monolingual corpora

Requirements

pip install subword-nmt
pip install langid

Publications

If you use this tool, please cite the following paper:

Matīss Rikters (2018). "Impact of Corpora Quality on Neural Machine Translation." In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).

@inproceedings{Rikters2018BalticHLT,
	author = {Rikters, Matīss},
	booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
	title = {{Impact of Corpora Quality on Neural Machine Translation}},
	address={Tartu, Estonia},
	year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
mono		mono
parallel		parallel
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
regular-expressions.php		regular-expressions.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpora Cleaning Tools

Tools included

Requirements

Publications

About

Releases

Packages

Languages

License

M4t1ss/parallel-corpora-tools

Folders and files

Latest commit

History

Repository files navigation

Corpora Cleaning Tools

Tools included

Requirements

Publications

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages