Byte Pair Encoding with Pointwise Mutual Information(PMI)

Luke Song (Notre Dame NLP Group)

Overview

This code explores a novel approach for pre-processing the corpus before training

Mechanism

If a corpus is compressed using a Shannon-optimal code, the compressed size would be

For this code, if wordpieces σ1 and σ2 is merged, then let δ be the count of the merged wordpiece. It updates all the variables as follows:

The code merges the two wordpieces that lead to the greatest decrease in compressed size, that is the two wordpieces that maximize:

Standard BPE chooses the two wordpieces that maximize c(σ1σ2). But the above formula multiplies this by a correction factor known as the pointwise mutual information of σ1 and σ2, which measures “how much σ1 and σ2 have to do with each other.” It will favor wordpiece pairs with high PMI, and would be expected a word and punctuation to have low PMI. Also, the above formula suggests that it should stop when the maximum of the above formula becomes negative.

Usage

bpe_modified.py -s <number of operations> [-orig] < text > codes_file¹

apply_bpe.py -c codes_file < text > out_file²

1: Standard BPE mode from subword_nmt

2: apply_bpe.py adapted from subword_nmt

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
apply_bpe.py		apply_bpe.py
bpe_modified.py		bpe_modified.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte Pair Encoding with Pointwise Mutual Information(PMI)

Overview

Mechanism

Usage

About

Releases

Packages

Languages

chanhee-luke/BPE_PMI

Folders and files

Latest commit

History

Repository files navigation

Byte Pair Encoding with Pointwise Mutual Information(PMI)

Overview

Mechanism

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages