Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added function "predict_output_word" to predict the output word given the context words. Fixes issue #863. #1209

Merged
merged 8 commits into from
Mar 20, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR adds a function predict_output_word, to the class Word2Vec, which runs the trained model and reports the probability values of the possible output words. This fixes #863.

@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

Please add unit tests and a note in CHANGELOG.md

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Sure. Also, I wanted to confirm if this function (just like the score function) would only be implemented for the hierarchical softmax scheme. To compute the final probability values, only self.syn1 has been used right now. If we also implement this for negative sampling, then we would have to use self.syn1neg.

Also, if we are only implementing this for the hierarchical softmax scheme, then we should add the check if not self.hs at the start of the function and show an appropriate error message like "We have currently only implemented predict_output_word for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work." Could you please confirm if this is correct?

@gojomo
Copy link
Collaborator

gojomo commented Mar 13, 2017

Hierarchical-softmax mode is non-default, and in my experience less commonly-used. Also, this code currently interprets the individual output-slots of syn1 as indicating, one-for-one, the vocabulary words in index2word order. However, that's an interpretation that's only valid for negative-sampling mode using syn1neg. (In HS mode, a word's variable length list points of the syn1 nodes must be tending towards the codes values to predict that one word.)

I'd suggest instead that the negative-sampling case be clearly and properly supported – as that has the easier interpretation (a single slot in syn1neg does refer to just one word). Then, work on figuring out a sensible way to report HS probabilities.

@chinmayapancholi13
Copy link
Contributor Author

@gojomo Thanks a lot for clarifying this. So, I'll change the current implementation of the function to serve negative sampling scheme first and then figure out how to report probabilities for hierarchical softmax case.

@chinmayapancholi13 chinmayapancholi13 changed the title Added function "predict_output_word" to predict the output word given the context words. Fixes issue #863. [WIP] Added function "predict_output_word" to predict the output word given the context words. Fixes issue #863. Mar 16, 2017
if word2_indices and self.cbow_mean:
l1 /= len(word2_indices)

if self.negative :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise exception

if not self.negative:
            raise RuntimeError("We have currently only implemented for negative sampling")
``

word2_indices.append(word.index)

l1 = np_sum(self.wv.syn0[word2_indices], axis=0)
if word2_indices and self.cbow_mean:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if word_vocabs is empty, then return None with a warning


word2_indices = []
for pos, word in enumerate(word_vocabs):
word2_indices.append(word.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use list comprehension

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Tests fixed by smart_open update

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Please add a unit test and a note in Changelog

@@ -35,35 +37,35 @@ Improvements:
* Phrases and Phraser allow a generator corpus (ELind77 [#1099](https://github.com/RaRe-Technologies/gensim/pull/1099))
* Ignore DocvecsArray.doctag_syn0norm in save. Fix #789 (@accraze,[#1053](https://github.com/RaRe-Technologies/gensim/pull/1053))
* Fix bug in LsiModel that occurs when id2word is a Python 3 dictionary. (@cvangysel,[#1103](https://github.com/RaRe-Technologies/gensim/pull/1103)
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101))
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk please check -- or even better, introduce an automated check -- that makes sure there's no trailing whitespace in commits.

Because it then leads to confusing diffs like this one, when someone (correctly!) removes the trailing whitespace later on.

@chinmayapancholi13
Copy link
Contributor Author

@tmylk I have made changes to CHANGELOG.md and also added a unit test as suggested by you earlier.

@tmylk
Copy link
Contributor

tmylk commented Mar 20, 2017

Thanks for the new feature. Would be good to add it to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

@tmylk tmylk merged commit cc86005 into piskvorky:develop Mar 20, 2017
@chinmayapancholi13
Copy link
Contributor Author

@tmylk Sure. I would update the IPython Notebook as well.

@exoticknight
Copy link

@chinmayapancholi13
thanks for your work but correct me if I use it the wrong way(I just replace the whole word2vec.py)
macOS Sierra 10.12.3
Python 2.7.13

plan1 = ["pick-up-B", "stack-B-A", "pick-up-D", "stack-D-C"]
plan2 = ["unstack-B-A", "put-down-B", "unstack-D-C", "put-down-D"]
plan3 = ["pick-up-B", "stack-B-A", "pick-up-C", "stack-C-B", "pick-up-D", "stack-D-C"]
plan4 = ["unstack-D-C", "put-down-D", "unstack-C-B", "put-down-C", "unstack-B-A", "put-down-B"]

from gensim.models import word2vec

raw_sentences = plan1 + plan2 + plan3 + plan4

sentences = [s.split() for s in raw_sentences]

model = word2vec.Word2Vec(sentences, min_count=1, size=10, workers=4)

# pick-up-B OOO unstack-D-C put-down-D OOO stack-C-B OOO OOO
# pick-up-B stack-B-A unstack-D-C put-down-D pick-up-C stack-C-B pick-up-D stack-D-C
a = model.predict_output_word(['put-down-D', 'stack-C-B'])

print(a)
# weird???
# [('put-down-B', 0.083333336), ('stack-B-A', 0.083333336), ('unstack-C-B', 0.083333336), ('pick-up-C', 0.083333336), ('stack-C-B', 0.083333336), ('unstack-B-A', 0.083333336), ('put-down-D', 0.083333336), ('stack-D-C', 0.083333336), ('pick-up-B', 0.083333336), ('pick-up-D', 0.083333336)]

@chinmayapancholi13
Copy link
Contributor Author

@exoticknight Here sentences, the input list being fed to the model for training, is of length 20 with each sentence being just one word long. So every word would be equally probable to be the output word for the input provided. And since the size of the vocabulary is 12 in this case, the probability value for each word would be 1/12 = 0.08333333. This should be the expected output, right? Is there something that I missed?

@exoticknight
Copy link

@chinmayapancholi13
oh clumsy me! 😢
thx for the help man 🎉

@chinmayapancholi13
Copy link
Contributor Author

@exoticknight No problem! Let me know if you face any other problems. I'd be happy to help. :)

@yzexeter
Copy link

@chinmayapancholi13 Hi, I checked the code and comments. From my understanding, the implementation is for CBOW. So is it right that given 'emergency','beacon','received' (from the tutorial), the output is the center word either between 'emergency' and 'beacon' or between 'beacon' and 'received'. Because I didn't see your discussions at first, I used the implementation to predict the next word given a list of words.

@chinmayapancholi13
Copy link
Contributor Author

@yzexeter Hey! Yes, this implementation is for CBOW, as mentioned in the original issue. This means that the list of words (context_words_list) passed to function predict_output_word() is the list of words in the context of the word(s) output by the function (as per our trained model). So the probability distribution output by the function is for the center word.

@yzexeter
Copy link

@chinmayapancholi13 Thank you for your explanation. As it is based on CBOW, what if I use skip-gram to train the model (sg = 1). It doesn't output warning, I assume it works. does the prediction under skip-gram have other meanings. is it still the center word of the context list?

@chinmayapancholi13
Copy link
Contributor Author

@yzexeter In CBOW, we train our model to predict the center (target) word correctly given the context words. On the other hand, in case of skip-gram, we train our model to train the context words correctly given a particular word as input.
You can (as in you'll get some valid output), but should not, use this function after training your model for the skip-gram model. This is because in such a scenario your training objective would be different than what the function expects while giving the output probabilities. That is, you are training your model to be good at predicting the context words given the center word, but the function is expected to give the output probabilities of the center word given the context words. So, the output values from the function that you get after training your model for skip-gram (rather than CBOW) may not be good.
Hence, the function has been implemented keeping CBOW in mind as it takes a list of context words as input and outputs the probability distribution of the center word. Such a format does not cohere with the skip-gram model.

@yzexeter
Copy link

@chinmayapancholi13 Thank you. This explains my results after I used skip-gram model to predict output. will this be further implemented along with skip-gram. I will keep track of future modification of the implementation.

@chinmayapancholi13
Copy link
Contributor Author

@yzexeter Hey! Sorry for the late response. As I had mentioned above, the input format of this function (i.e a list of context words) doesn't really cohere with the skip-gram model (which predicts things the other way around i.e. predicts context words given the central word). I guess there can be a separate function for skip-gram model for doing this, but I don't have any plans right now to extend the same function to do that. :)

@gojomo
Copy link
Collaborator

gojomo commented May 29, 2017

@chinmayapancholi13 the difference between "focus word predicts all context window words" or "all context window words are used to predict focus word" ultimately isn't that significant - in the end, all the exact same "input word -> predicted word" pairs are used for training, just in a slightly different order. (IIRC, the word2vec paper describes it one way, but the Google word2vec.c code does it the other way, because they found slightly better CPU cache utilization patterns & thus bulk performance the way the code does it.)

A skip-gram predict-word function would need the exact same context-window input - but would presumably calculate the individual predictions for every context word, then average all those predictions – even more expensive, by a factor of 2 * window, than the CBOW approach.

(Either the CBOW or SG predictions should perhaps also simulate the distance-weighting that occurs during training. During training passes, windows aren't actually window size, but some random size from 1 to window. This means nearby words are most-often part of training-examples, and further-words are less-often part of training examples – effectively, a distance-weighting, but because it's accomplished by often leaving things out, rather than scaling words' effects, it speed training rather than the slowdown that extra scaling would required.)

@yzexeter
Copy link

@chinmayapancholi13 No problem. Thanks for your help. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Word Prediction
6 participants