-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPMI scorer does not take into account min_count
#2086
Comments
The default score also does not takes into account properly. Your function sadly does not correct the problem. The behavior is very strange, I think the min_count should be implemented in the analyze_sentence instead of appearing in the score |
In my tests, the default scorer does take into account the min_count parameter properly. I verified that by changing the parameter many times and looking at the length of the Phraser.phrasegrams dictionary. However, when using 'npmi' scorer, the length never changes by modifying the min_count, only with the threshold parameter. Edit: after going through the phrases.py code I see why min_count is taken into account. The formula of the default scorer does (bigram_count - min_count) in the numerator. Bigrams with count less than min_count will always have a negative score. As long as the threshold supplied is positive, these will be filtered. This leads me to think that @kikoaumond solution should work, so long as the threshold supplied is greater than or equal to -1. |
A while ago, I also noticed that problem with NPMI and sumbmitted a pull request on that |
min_count
Fixed in #2072 |
in phrases.py, npmi_scorer does not take into account min_count when scoring bigrams.
I suggest the following alternative for taking min_count into account when using NPMI scoring. This function has yielded good results for me; however, I have not explored other ways to use min_count:
The text was updated successfully, but these errors were encountered: