Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPMI scorer does not take into account min_count #2086

Closed
kikoaumond opened this issue Jun 7, 2018 · 4 comments
Closed

NPMI scorer does not take into account min_count #2086

kikoaumond opened this issue Jun 7, 2018 · 4 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@kikoaumond
Copy link

in phrases.py, npmi_scorer does not take into account min_count when scoring bigrams.

I suggest the following alternative for taking min_count into account when using NPMI scoring. This function has yielded good results for me; however, I have not explored other ways to use min_count:

# Custom scoring function to take into account min_counts when using npmi
def npmi_scorer_with_min_count(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):

    if bigram_count < min_count:
        return -1

    pa = worda_count / corpus_word_count
    pb = wordb_count / corpus_word_count
    pab = bigram_count / corpus_word_count
    return log(pab / (pa * pb)) / -log(pab)

@EstevaoUyra
Copy link

The default score also does not takes into account properly. Your function sadly does not correct the problem. The behavior is very strange, I think the min_count should be implemented in the analyze_sentence instead of appearing in the score

@rafabr4
Copy link

rafabr4 commented Jul 11, 2018

In my tests, the default scorer does take into account the min_count parameter properly. I verified that by changing the parameter many times and looking at the length of the Phraser.phrasegrams dictionary. However, when using 'npmi' scorer, the length never changes by modifying the min_count, only with the threshold parameter.

Edit: after going through the phrases.py code I see why min_count is taken into account. The formula of the default scorer does (bigram_count - min_count) in the numerator. Bigrams with count less than min_count will always have a negative score. As long as the threshold supplied is positive, these will be filtered. This leads me to think that @kikoaumond solution should work, so long as the threshold supplied is greater than or equal to -1.

@lopusz
Copy link
Contributor

lopusz commented Jul 13, 2018

A while ago, I also noticed that problem with NPMI and sumbmitted a pull request on that
#2072

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Jul 30, 2018
@menshikh-iv menshikh-iv changed the title in Phrases, NPMI scorer does not take into account min_count NPMI scorer does not take into account min_count Jul 31, 2018
@menshikh-iv
Copy link
Contributor

Fixed in #2072

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

5 participants