README.md

Parallel Corpora Tools

Tools for filtering and cleaning parallel corpora in order to train better neural machine translation systems.

0-do-it-all.sh
- Calls all the proceeding scripts in order.
- Parameters - directory of files to clean and two letter language code for source and target languages. The directory must contain at least one file ending with each language code (parallel.en and parallel.et)
  - 0-do-it-all.sh /home/matiss/data/english-estonian-parallel-data en et
1-find-equal-lines.sh
- Gets rid of sentences that are identical in both - the source and target side.
2-unique-parallel.sh
- Removes duplicate parallel sentences.
- Removes repeating source sentences aligned to multiple target sentences and repeating target sentences aligned to multiple source sentences.
- Removes sentences that contain more non-alphabetical symbols than alphabetical ones.
- Removes sentence pairs where there are significantly more non-alphabetical symbols than on one side compared to the other.
- Removes sentence pairs that have repeating tokens. This filter is more useful for dealing with back-translated data from NMT.
3-identify-language.sh
- Removes sentences that are not in the specified source or target language.
4-moses-scripts-subword-nmt.sh
- The regular Moses tokenizer -> cleaner -> truecaser and subword NMT.