use cmudict for more accurate syllable counting in en_US #201

Sean-Hastings · 2024-12-19T21:11:57Z

Change Summary

Currently, Pyphen is used for syllable counting in all languages. It is not very accurate for this task, and this affects a lot of the different metrics textstat supports.

In this PR I propose using cmudict for en_US because it is far more accurate than Pyphen. Being a dictionary it does have "holes", but using Pyphen as a backup when those are detected can give us the best of both worlds.

I also expanded the testing for syllable counting, specifically designing the test change to track the "true" and currently expected syllable counts separately.

Related conversations from the repo:

#195
#167 (comment)

Justification/Motivation

For deciding whether this was worth doing I used this notebook/script to check a few different words from #195 and the existing test texts (which I labeled myself based on my own personal pronunciation). I found that pyphen was off by an average of .183 syllables per word while the cmudict approach was off by an average of .017 syllables per word, 10x less. This is of course not scientific at all due to being my own pronunciations and so few words but I think it motivates the PR well regardless.

For the sake of completeness, results (error rate) were as follows:
pyphen: 0.183
syllables: 0.117
cmudict mean: 0.057
cmudict first: 0.017

syllable_words = {
    "couple": 2,
    "enriched": 2,
    "us": 1,
    "too": 1,
    "monopoly": 4,
    "him": 1,
    "he": 1,
    "without": 2,
    "creative": 3,
    "every": 3,
    "stimulating": 4,
    "our": 1.5,
    "life": 1,
    "cupboards": 2,
    "day's": 1,
    "forgotten": 3,
    "through": 1,
    "marriage": 2,
    "hello": 2,
    "faerie": 2,
    "relive": 2,
    "cool": 1,
    "dogs": 1,
    "wear": 1,
    "da": 1,
    "the": 1,
    "sunglasses": 3,
    "sentences": 3,
    "songwriter": 3,
    "removing": 3
}
from typing import Callable

import numpy as np


def test_func(func: Callable[[str], int]):
    preds = [func(word.lower()) for word in syllable_words.keys()]
    _diffs = [(word, pred - score) for pred, (word, score) in zip(preds, syllable_words.items()) if pred]
    words, diffs = zip(*_diffs)
    mean_ = np.mean(np.abs(diffs))
    print(f"Overall mean: {mean_}")
    print(f"Able to score {len(diffs)} of {len(syllable_words)} words")
    print(f"Failed to score: {[w for w in syllable_words.keys() if w not in words]}")
    print()
    for word, diff in _diffs:
        print(f"{word}: {diff}")
    return mean_
from pyphen import Pyphen
phen = Pyphen(lang="en_US")

def pyphen_get(word: str) -> int:
    pyphen_locs = len(phen.positions(word))
    return pyphen_locs + 1

test_func(pyphen_get)
import syllables

def syll_get(word: str) -> int:
    syll_count = syllables.estimate(word)
    return syll_count

test_func(syll_get)
import cmudict

cmu_dict = cmudict.dict()

def cmu_get_mean(word: str) -> int:
    cmu_phones = cmu_dict.get(word)
    if cmu_phones:
        return np.mean([len([p for p in phone if p[-1].isdigit()]) for phone in cmu_phones])
    return 0

test_func(cmu_get_mean)
def cmu_get_first(word: str) -> int:
    cmu_phones = cmu_dict.get(word)
    if cmu_phones:
        return len([p for p in cmu_phones[0] if p[-1].isdigit()])
    return 0

test_func(cmu_get_first)

Sean-Hastings added 3 commits December 19, 2024 15:41

use cmudict for more accurate syllable counting in en_US

e8b3ccb

flake line length fix

28e0e68

cmudict requirement support

43d13a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cmudict for more accurate syllable counting in en_US #201

use cmudict for more accurate syllable counting in en_US #201

Sean-Hastings commented Dec 19, 2024

use cmudict for more accurate syllable counting in en_US #201

Are you sure you want to change the base?

use cmudict for more accurate syllable counting in en_US #201

Conversation

Sean-Hastings commented Dec 19, 2024

Change Summary

Related conversations from the repo:

Justification/Motivation