Luke Song (Notre Dame NLP Group)
This code explores a novel approach for pre-processing the corpus before training
If a corpus is compressed using a Shannon-optimal code, the compressed size would be
For this code, if wordpieces σ1 and σ2 is merged, then let δ be the count of the merged wordpiece. It updates all the variables as follows:
The code merges the two wordpieces that lead to the greatest decrease in compressed size, that is the two wordpieces that maximize:
Standard BPE chooses the two wordpieces that maximize c(σ1σ2). But the above formula multiplies this by a correction factor known as the pointwise mutual information of σ1 and σ2, which measures “how much σ1 and σ2 have to do with each other.” It will favor wordpiece pairs with high PMI, and would be expected a word and punctuation to have low PMI. Also, the above formula suggests that it should stop when the maximum of the above formula becomes negative.
bpe_modified.py -s <number of operations> [-orig] < text > codes_file
1
apply_bpe.py -c codes_file < text > out_file
2
1: Standard BPE mode from subword_nmt
2: apply_bpe.py adapted from subword_nmt