Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Phrases.export_phrases() #794

Closed
jeradf opened this issue Jul 16, 2016 · 4 comments
Closed

Bug in Phrases.export_phrases() #794

jeradf opened this issue Jul 16, 2016 · 4 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@jeradf
Copy link

jeradf commented Jul 16, 2016

A small bug in bigram.export_phrases(sentences) causes it to return a maximum of one bigram per sentence.
For example:

from gensim.models import Phrases

sentences = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey','human','interface']]

bigram = Phrases(sentences, min_count=1, threshold=2)
for phrase, score in bigram.export_phrases(sentences):
    print(u'{0}   {1}'.format(phrase, score))

Returns:

>>> human interface   3.44444444444
response time   7.75
response time   7.75
graph minors   5.16666666667
graph minors   5.16666666667

The last sentence has two bigrams, but only the first is returned. This happens because this line
https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/phrases.py#L216
prevents further iteration within a sentence.
If it's commented out, it correctly returns:

>>> human interface   3.44444444444
response time   7.75
response time   7.75
graph minors   5.16666666667
graph minors   5.16666666667
human interface   3.44444444444
@gojomo
Copy link
Collaborator

gojomo commented Jul 16, 2016

Good find!

I believe the intent of last_bigram is to prevent a word from being in two overlapping bigrams; the bug may be that last_bigram is not toggled back to False at the end of each non-bigram iteration.

So perhaps best fix is to add continue after the toggle-on, and at bottom of (non-continued/non-bigrammed) loop block, add a last_bigram = False line.

@tmylk tmylk added bug Issue described a bug difficulty easy Easy issue: required small fix labels Sep 18, 2016
@AadityaJ
Copy link
Contributor

Hi everyone!
Today I worked on this. I am sending a pull request.

@gojomo
Copy link
Collaborator

gojomo commented May 14, 2017

As reported on discussion list (https://groups.google.com/forum/#!topic/gensim/N0nMD95N6Iw), the fix as applied wasn't really effective. I think my suggestion is still valid: the line just needs to be at the end of the loop, outside any conditionals. (Specifically, de-indented two levels.) The test-for-two-returned-phrases that was added didn't really probe the behavior of export_phrases(), but __getitem__() (which AFAICT didn't have the problem, so the added test wouldn't have failed even before the applied-change to export_phrases().)

Reopening. Best approach will be to make a valid failing test first, then try the de-indentation as a fix.

@gojomo gojomo reopened this May 14, 2017
tmylk pushed a commit that referenced this issue May 23, 2017
* fixing a single phrase not returning back multiple bigrams, #794

* addressing feedback: removing print statement.
@tmylk
Copy link
Contributor

tmylk commented May 23, 2017

Fixed in #1362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

4 participants