Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2vec n_similarity returning numpy matrix instead of float with empty list #743

Closed
smagnan opened this issue Jun 14, 2016 · 4 comments
Closed
Labels
difficulty easy Easy issue: required small fix

Comments

@smagnan
Copy link

smagnan commented Jun 14, 2016

I don't know if this should be seen as a bug or not, but it the case where we give at least an empty list to n_similarity is not handled. (The result is not surprising tho)

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("frwiki.model")
>>> model.n_similarity(['real','world'],['everyday','life'])
0.46619348966885127
>>> model.n_similarity(['real','world'],[])
/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan])

I would have expected that to raise an exception but maybe it is not what was intended ...

Source code is rather simple:

def n_similarity(self, ws1, ws2):
        """
        Compute cosine similarity between two sets of words.
        Example::
          >>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
          0.61540466561049689
          >>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
          1.0000000000000004
          >>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
          True
        """
        v1 = [self[word] for word in ws1]
        v2 = [self[word] for word in ws2]
        return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
@piskvorky
Copy link
Owner

Yes, good catch. This is numpy trying to be clever and issuing a warning, instead of a normal exception for division by zero.

I agree that this is unintuitive, and we should detect this and raise an exception ourselves. Is ZeroDivisionError what you would have expected, @smagnan ?

@tmylk
Copy link
Contributor

tmylk commented Jun 21, 2016

@smagnan Ping - is it ok if we implement ZeroDivisionError exception here? Or is there a better way?

@smagnan
Copy link
Author

smagnan commented Jun 22, 2016

Sorry for the delay to reply, I planned to do a patch at first but then forgot about it... Yes I guess a ZeroDivisionError is what I would have expected, or a specific ValueError. I am no python expert so I'm not sure about what's best: a custom error or a generic one

@tmylk tmylk added the difficulty easy Easy issue: required small fix label Jun 27, 2016
pranay360 added a commit to pranay360/gensim that referenced this issue Sep 25, 2016
Fixes Issue piskvorky#743, n_similarity method now raises ZeroDivisionError if atleast one empty list is passed to it.
pranay360 added a commit to pranay360/gensim that referenced this issue Sep 25, 2016
Added new test cases in testSimilarities method which makes sure whether ZeroDivisionError is raised if atleast one empty list is passed to word2vec.n_similarities method
Related to fix for issue piskvorky#743
@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

Fixed in #883

@tmylk tmylk closed this as completed Oct 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

3 participants