Machine Translation (MT) Preparation Scripts
The filtering and subwording scripts use a number of Python packages. To install these dependencies using pip
run the following command in Terminal/CMD:
pip3 install --user -r requirements.txt
There is one script to use for cleaning your Machine Translation dataset. You must have two files, one for the source and one for the target. If you rather have one TMX file, you can first use the TXM2MT converter.
The filter script achieves the following steps:
- Deleting empty rows;
- Deleting duplicates;
- Deleting source-copied rows;
- Deleting too long Source/Target (ratio 200% and > 200 words);
- Removing HTML;
- Segments will remain in the true-case unless lower is True;
- Shuffling rows; and
- writing the output files.
Run the filtering script in the Terminal/CMD as follows:
python3 filter.py <source_file_path> <target_file_path> <source_lang> <target_lang>
It is recommended to run the subwording process, as it helps your Machine Translation engine avoid out-of-vocabulary tokens. The subwording scripts apply SentencePiece to your source and target Machine Translation files. There are three scripts provided:
You need to create two subwording models to learn the vocabulary of your source and target.
python3 train.py <train_source_file_tok> <train_target_file_tok>
By default, the subwording model type is unigram
. You can change it BPE by adding --model_type=bpe
to these lines in the script as follows:
source_train_value = '--input='+train_source_file_tok+' --model_prefix=source --vocab_size='+str(source_vocab_size)+' --hard_vocab_limit=false --model_type=bpe'
target_train_value = '--input='+train_target_file_tok+' --model_prefix=target --vocab_size='+str(target_vocab_size)+' --hard_vocab_limit=false --model_type=bpe'
Optionally, you can add more options like --split_digits=true
to split all digits (0-9) into separate pieces, or --byte_fallback=true
to decompose unknown pieces into UTF-8 byte pieces, which might help avoid out of vocabulary tokens.
Notes for big corpora:
- You can use
--train_extremely_large_corpus=true
for a big corpus to avoid memory issues. - The default SentencePiece value for
--input_sentence_size
is 0, i.e. the whole corpus. You can change it to a value between 1 and 10 million sentences, which will be enough for creating a good SentencePiece model. - When the value of
--input_sentence_size
is less than the size of the corpus, it is recommended to set--shuffle_input_sentence=true
to make your sample representative to the distribution of your data. - The default SentencePiece value for
--vocab_size
is 8,000. You can go for a higher value between 30,000 and 50,000, and up to 100,000 for a big corpus. Still, note that smaller values will encourage the model to make more splits on words, which might be better in the case of a multilingual model if the languages share the alphabet.
In this step, you use the models you created in the previous step to subword your source and target Machine Translation files. You have to apply the same step on the source files to be translated later with the Machine Translation model.
python3 subword.py <sp_source_model_path> <sp_target_model_path> <source_file_path> <target_file_path>
Notes for OpenNMT users:
- If you are using OpenNMT, you can add
<s>
and</s>
to the source only. Remove<s>
and</s>
from the target as they are already added by default (reference). Alternatively, in OpenNMT-tf, there is an option calledsource_sequence_controls
to addstart
and/orend
tokens to the source. - After you segment your source and target files with the generated SentencePiece models, you must build vocab using OpenNMT-py to generate vocab files compatible with it. OpenNMT-tf has an option that allows converting SentencePiece vocab to a compatible format. Similarly, you can convert the SentencePiece vocab file into a format compatible with OpenNMT-py as follows:
pip3 install --upgrade OpenNMT-py
wget https://raw.githubusercontent.com/OpenNMT/OpenNMT-py/master/tools/spm_to_vocab.py
cat spm.vocab | python3 spm_to_vocab.py > spm.onmt_vocab
- Before you start training with OpenNMT-py, you must configure
src_vocab_size
andtgt_vocab_size
to exactly match the value you used for--vocab_size
in SentencePiece. The default is 32768, which is usually good for medium-sized data.
This step is useful after training your Machine Translation model and translating files with it, as you need to decode/desubword the generated target (i.e. translated) files.
python3 desubword.py <target_model_file> <target_pred_file>
In this step, you split the parallel dataset into training and development datasets. The first argument is the number of segments you want in the development dataset; the script randomly selects this number of segments for the dev set and keeps the rest for the train set.
python3 train_dev_split.py <dev_segment_number> <source_file_path> <target_file_path>
If you have questions or suggestions, please feel free to contact me.
@inproceedings{moslem-etal-2022-domain,
title = "Domain-Specific Text Generation for Machine Translation",
author = "Moslem, Yasmin and
Haque, Rejwanul and
Kelleher, John and
Way, Andy",
booktitle = "Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)",
month = sep,
year = "2022",
address = "Orlando, USA",
publisher = "Association for Machine Translation in the Americas",
url = "https://aclanthology.org/2022.amta-research.2",
pages = "14--30",
}