-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix empty output bug in Phrases
. Fix #1401
#1853
Conversation
…into develop Conflicts: gensim/summarization/bm25.py
…versions , test added for empty iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to see a concrete example that doesn't work correctly earlier, but now works fine (because we can't reproduce it properly in unittest).
gensim/models/phrases.py
Outdated
@@ -115,7 +115,7 @@ def _is_single(obj): | |||
return True, obj_iter | |||
else: | |||
# If the first item isn't a string, assume obj is a corpus | |||
return False, obj_iter | |||
return False, list(obj_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj_iter
can be very large -> we can receive out-of-memory here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also works if I return just obj
instead of list(obj_iter)
but it fails if the obj is an iterator such as iter([ [] , [] ])
, so I am thinking of adding a special case to handle such iterators because some of the unit tests were failing and else otherwise just return the obj
.
Proposed change ,
obj_iter = iter(obj)
temp_iter = obj_iter
try:
peek = next(obj_iter)
obj_iter = it.chain([peek], obj_iter)
except StopIteration:
# An empty object is a single document
return True, obj
if isinstance(peek, string_types) :
# It's a document, return the iterator
return True, obj_iter
if temp_iter == obj :
#Checking for an iterator to the object
return False , list(obj_iter)
else :
# If the first item isn't a string, assume obj is a corpus
return False, obj_iter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv What do you think about the above proposed changes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sj29-innovate anyway, list(...)
unacceptable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv Yes , I agree that is not a very good way to do that . This works for me without using list(...)
,
obj_iter = iter(obj)
temp_iter = obj_iter
try:
peek = next(obj_iter)
obj_iter = it.chain([peek], obj_iter)
except StopIteration:
# An empty object is a single document
return True, obj
if isinstance(peek, string_types):
# It's a document, return the iterator
return True, obj_iter
if temp_iter == obj :
#Checking for iterator to the object
return False , obj_iter
else:
# If the first item isn't a string, assume obj is a corpus
return False, obj
I realised there was no use of using the list(...)
, works well with the above change.
Here are some of the results i am getting with my proposed changes . I am using eng_news_2015_30K.tar.gz corpus and the script provided here https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py for my testing. Comparing the output of " print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])) " before and after the changes on latest version -> Without Changes ->
With the changes ->
|
@sj29-innovate can you make more "focused" example (without word2vec), only with phrases? |
@menshikh-iv I encountered this when i ran two loops over the same object returned by "is_single" one after the another and the second loop didnt work. My test case was as follows ,
Output ->
Expected Output->
|
@sj29-innovate yes, almost what I wanted to see, please make it executable (all imports, fits Phrases, dataset), I want to copy-paste it and run with different gensim version. |
@menshikh-iv Here is the executable code for the above testcase. Added brown corpus for training for a better phrases behaviour. from gensim.models.phrases import Phrases,Phraser
from nltk.corpus import brown
test_sents = [
[ 'I','did','not'],
['I','can','not']
]
bigrams_phrases = Phrases(brown.sents())
bigrams_model = Phraser(bigrams_phrases)
trigrams_phrases = Phrases(bigrams_model[brown.sents()])
trigrams_model = Phraser(trigrams_phrases)
model = trigrams_model[bigrams_model[test_sents]]
print "Printing in First Loop\n"
for sent in model :
print sent
print "\nPrinting in Second Loop\n"
for sent in model :
print sent |
@menshikh-iv I believe we must look at generator type check in word2vec which failed to recognize "itertools.chain" and therefore ended up training on empty corpus. If the changes to "is_single" function seems fine to you , I can commit them. |
@sj29-innovate can you look into this function https://github.com/RaRe-Technologies/gensim/blob/255ce25903c2521ad6c92c00875a7dc3aa88fe7d/gensim/utils.py#L797 looks very similar to the current case. |
@menshikh-iv The problem I figured out was that it was not possible to do multiple iterations of the object we got from the Phraser output as it was a "generator" object , Please look at the following snippet PhraseObject = trigrams_model[bigrams_model[sentences]]
print type(PhraseObject)
print type(PhraseObject.corpus) Output
def __iter__(self):
if self.chunksize:
for chunk in utils.grouper(self.corpus, self.chunksize):
for transformed in self.obj.__getitem__(chunk, chunksize=None):
yield transformed
else:
for doc in self.corpus:
yield self.obj[doc] So basically the self.corpus is an itertools.chain object which is getting emptied when we iterate through it. |
@sj29-innovate I understand what you mean, can you push your last changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like almost done, please make last changes.
gensim/test/test_phrases.py
Outdated
bigram_phraser = Phraser(bigram_phrases) | ||
trigram_phrases = Phrases(bigram_phraser[self.sentences]) | ||
trigram_phraser = Phraser(trigram_phrases) | ||
self.assertNotEqual(trigram_phraser[bigram_phraser[self.sentences]].__len__(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to make it more similar with example in PR
trigrams = trigram_phraser[bigram_phraser[self.sentences]]
fst, snd = list(trigrams), list(trigrams)
self.assertEqual(fst, snd)
self.assertNotEqual(snd, [])
@sj29-innovate good work: +1: |
@menshikh-iv Thanks , can you suggest some further issues once it's reviewed by Gordon. |
@sj29-innovate of course! |
Phrases
. Fix #1401
@menshikh-iv Can you already suggest me some other issue I can start working on. Thanks. |
@sj29-innovate try to fix 1869 (fresh-investigated bug in MmCorpus) |
@gojomo ping |
I think the preexisting API - using []-indexing to do something different to a single-case or an iterable, and the With regard to a minimal fix for the specific issue in #1401, if the test case works in a suitably-old gensim (0.13.4.1), then breaks in a newer version (1.0.0 through 2.1.0) – matching the #1401 report – then works again after the fix, that's a good sign. It'd be even better if the exact commit that broke the behavior was identified, and the exact person who made that commit (or understood the cases I'm suspicious this may fail too-subtly in the common mistake-case where someone provides an iterator object, and not a restartable iterable object - it'll fail to restore the peeked object, and it will only be usable for a single iteration (possibly resulting in other subtle issues down-the-line, if other training expects re-iteration). |
@gojomo so, you propose to completely remove |
I’d discard the practice of using []-indexing as the interface to stream/generator-phrasification – a compatibility-breaking change. Then there’d be no reason for confusing |
@gojomo, unfortunately, this is almost impossible now (too breaking change now, also we broke very stable API). |
@menshikh-iv What are the further requirements here ? |
Phrases
. Fix #1401Phrases
. Fix #1401
@sj29-innovate for the current state - that's all that we can do here (without any breaking changes), thank you @sj29-innovate! |
* bm25 scoring function updated * Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes piskvorky#1401 , IS_SINGLE Function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , tests for phrasified sentences added
The "is_single" function previously returned iterator in case the object was a corpus and was followed by self._apply() which again made an iterator , it works fine now with this fix.
(fixes #1401)