Tools for filtering and cleaning parallel corpora in order to train better neural machine translation systems.
- Calls all the proceeding scripts in order.
- Parameters - directory of files to clean and two letter language code for source and target languages. The directory must contain at least
one file ending with each language code (parallel.en and /home/matiss/data/english-estonian-parallel-data en et
- Gets rid of sentences that are identical in both - the source and target side.
- Removes duplicate parallel sentences.
- Removes repeating source sentences aligned to multiple target sentences and repeating target sentences aligned to multiple source sentences.
- Removes sentences that contain more non-alphabetical symbols than alphabetical ones.
- Removes sentence pairs where there are significantly more non-alphabetical symbols than on one side compared to the other.
- Removes sentence pairs that have repeating tokens. This filter is more useful for dealing with back-translated data from NMT.
- Removes sentences that are not in the specified source or target language.
- The regular Moses tokenizer -> cleaner -> truecaser and subword NMT.