use cmudict for more accurate syllable counting in en_US #201
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Summary
Currently, Pyphen is used for syllable counting in all languages. It is not very accurate for this task, and this affects a lot of the different metrics textstat supports.
In this PR I propose using cmudict for en_US because it is far more accurate than Pyphen. Being a dictionary it does have "holes", but using Pyphen as a backup when those are detected can give us the best of both worlds.
I also expanded the testing for syllable counting, specifically designing the test change to track the "true" and currently expected syllable counts separately.
Related conversations from the repo:
#195
#167 (comment)
Justification/Motivation
For deciding whether this was worth doing I used this notebook/script to check a few different words from #195 and the existing test texts (which I labeled myself based on my own personal pronunciation). I found that pyphen was off by an average of .183 syllables per word while the cmudict approach was off by an average of .017 syllables per word, 10x less. This is of course not scientific at all due to being my own pronunciations and so few words but I think it motivates the PR well regardless.
For the sake of completeness, results (error rate) were as follows:
pyphen: 0.183
syllables: 0.117
cmudict mean: 0.057
cmudict first: 0.017