Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use cmudict for more accurate syllable counting in en_US #201

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Sean-Hastings
Copy link

Change Summary

Currently, Pyphen is used for syllable counting in all languages. It is not very accurate for this task, and this affects a lot of the different metrics textstat supports.

In this PR I propose using cmudict for en_US because it is far more accurate than Pyphen. Being a dictionary it does have "holes", but using Pyphen as a backup when those are detected can give us the best of both worlds.

I also expanded the testing for syllable counting, specifically designing the test change to track the "true" and currently expected syllable counts separately.

Related conversations from the repo:

#195
#167 (comment)

Justification/Motivation

For deciding whether this was worth doing I used this notebook/script to check a few different words from #195 and the existing test texts (which I labeled myself based on my own personal pronunciation). I found that pyphen was off by an average of .183 syllables per word while the cmudict approach was off by an average of .017 syllables per word, 10x less. This is of course not scientific at all due to being my own pronunciations and so few words but I think it motivates the PR well regardless.

For the sake of completeness, results (error rate) were as follows:
pyphen: 0.183
syllables: 0.117
cmudict mean: 0.057
cmudict first: 0.017

syllable_words = {
    "couple": 2,
    "enriched": 2,
    "us": 1,
    "too": 1,
    "monopoly": 4,
    "him": 1,
    "he": 1,
    "without": 2,
    "creative": 3,
    "every": 3,
    "stimulating": 4,
    "our": 1.5,
    "life": 1,
    "cupboards": 2,
    "day's": 1,
    "forgotten": 3,
    "through": 1,
    "marriage": 2,
    "hello": 2,
    "faerie": 2,
    "relive": 2,
    "cool": 1,
    "dogs": 1,
    "wear": 1,
    "da": 1,
    "the": 1,
    "sunglasses": 3,
    "sentences": 3,
    "songwriter": 3,
    "removing": 3
}
from typing import Callable

import numpy as np


def test_func(func: Callable[[str], int]):
    preds = [func(word.lower()) for word in syllable_words.keys()]
    _diffs = [(word, pred - score) for pred, (word, score) in zip(preds, syllable_words.items()) if pred]
    words, diffs = zip(*_diffs)
    mean_ = np.mean(np.abs(diffs))
    print(f"Overall mean: {mean_}")
    print(f"Able to score {len(diffs)} of {len(syllable_words)} words")
    print(f"Failed to score: {[w for w in syllable_words.keys() if w not in words]}")
    print()
    for word, diff in _diffs:
        print(f"{word}: {diff}")
    return mean_
from pyphen import Pyphen
phen = Pyphen(lang="en_US")

def pyphen_get(word: str) -> int:
    pyphen_locs = len(phen.positions(word))
    return pyphen_locs + 1

test_func(pyphen_get)
import syllables

def syll_get(word: str) -> int:
    syll_count = syllables.estimate(word)
    return syll_count

test_func(syll_get)
import cmudict

cmu_dict = cmudict.dict()

def cmu_get_mean(word: str) -> int:
    cmu_phones = cmu_dict.get(word)
    if cmu_phones:
        return np.mean([len([p for p in phone if p[-1].isdigit()]) for phone in cmu_phones])
    return 0

test_func(cmu_get_mean)
def cmu_get_first(word: str) -> int:
    cmu_phones = cmu_dict.get(word)
    if cmu_phones:
        return len([p for p in cmu_phones[0] if p[-1].isdigit()])
    return 0

test_func(cmu_get_first)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant