Skip to content

Byte Pair Encoding with Pointwise Mutual Information

Notifications You must be signed in to change notification settings

chanhee-luke/BPE_PMI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Byte Pair Encoding with Pointwise Mutual Information(PMI)

Luke Song (Notre Dame NLP Group)

Overview

This code explores a novel approach for pre-processing the corpus before training

Mechanism

If a corpus is compressed using a Shannon-optimal code, the compressed size would be

For this code, if wordpieces σ1 and σ2 is merged, then let δ be the count of the merged wordpiece. It updates all the variables as follows:

The code merges the two wordpieces that lead to the greatest decrease in compressed size, that is the two wordpieces that maximize:

Standard BPE chooses the two wordpieces that maximize c(σ1σ2). But the above formula multiplies this by a correction factor known as the pointwise mutual information of σ1 and σ2, which measures “how much σ1 and σ2 have to do with each other.” It will favor wordpiece pairs with high PMI, and would be expected a word and punctuation to have low PMI. Also, the above formula suggests that it should stop when the maximum of the above formula becomes negative.

Usage

bpe_modified.py -s <number of operations> [-orig] < text > codes_file1

apply_bpe.py -c codes_file < text > out_file2








1: Standard BPE mode from subword_nmt

2: apply_bpe.py adapted from subword_nmt

About

Byte Pair Encoding with Pointwise Mutual Information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages