Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty output bug in Phrases. Fix #1401 #1853

Merged
merged 8 commits into from
Feb 15, 2018
2 changes: 1 addition & 1 deletion gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def _is_single(obj):
return True, obj_iter
else:
# If the first item isn't a string, assume obj is a corpus
return False, obj_iter
return False, list(obj_iter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj_iter can be very large -> we can receive out-of-memory here.

Copy link
Contributor Author

@sj29-innovate sj29-innovate Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also works if I return just obj instead of list(obj_iter) but it fails if the obj is an iterator such as iter([ [] , [] ]) , so I am thinking of adding a special case to handle such iterators because some of the unit tests were failing and else otherwise just return the obj.
Proposed change ,

    obj_iter = iter(obj)
    temp_iter = obj_iter
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document 
        return True, obj
    if isinstance(peek, string_types) :
        # It's a document, return the iterator
        return True, obj_iter
    if temp_iter == obj :
        #Checking for an iterator to the object
    	return False , list(obj_iter)
    else :
        # If the first item isn't a string, assume obj is a corpus
        return False, obj_iter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv What do you think about the above proposed changes ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sj29-innovate anyway, list(...) unacceptable here.

Copy link
Contributor Author

@sj29-innovate sj29-innovate Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv Yes , I agree that is not a very good way to do that . This works for me without using list(...),

    obj_iter = iter(obj)
    temp_iter = obj_iter
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    if temp_iter == obj :
        #Checking for iterator to the object    
    	return False , obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, obj

I realised there was no use of using the list(...), works well with the above change.



class SentenceAnalyzer(object):
Expand Down
7 changes: 7 additions & 0 deletions gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,13 @@ def setUp(self):
self.bigram_unicode = Phrases(
self.unicode_sentences, min_count=1, threshold=1, common_terms=self.common_terms)

def testEmptyPhrasifiedSentencesIterator(self):
bigram_phrases = Phrases(self.sentences)
bigram_phraser = Phraser(bigram_phrases)
trigram_phrases = Phrases(bigram_phraser[self.sentences])
trigram_phraser = Phraser(trigram_phrases)
self.assertNotEqual(trigram_phraser[bigram_phraser[self.sentences]].__len__(), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to make it more similar with example in PR

trigrams = trigram_phraser[bigram_phraser[self.sentences]]
fst, snd = list(trigrams), list(trigrams)
self.assertEqual(fst, snd)
self.assertNotEqual(snd, [])


def testEmptyInputsOnBigramConstruction(self):
"""Test that empty inputs don't throw errors and return the expected result."""
# Empty list -> empty list
Expand Down