Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annoy.py conversion of cosine distance to cosine similarity is incorrect #3440

Closed
monash8 opened this issue Feb 7, 2023 · 4 comments · Fixed by #3441
Closed

annoy.py conversion of cosine distance to cosine similarity is incorrect #3440

monash8 opened this issue Feb 7, 2023 · 4 comments · Fixed by #3441
Labels
bug Issue described a bug impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Milestone

Comments

@monash8
Copy link
Contributor

monash8 commented Feb 7, 2023

in this function the code to calculate cosine similarity is incorrect

def most_similar(self, vector, num_neighbors):
    """Find `num_neighbors` most similar items.

    Parameters
    ----------
    vector : numpy.array
        Vector for word/document.
    num_neighbors : int
        Number of most similar items

    Returns
    -------
    list of (str, float)
        List of most similar items in format [(`item`, `cosine_distance`), ... ]

    """
    ids, distances = self.index.get_nns_by_vector(
        vector, num_neighbors, include_distances=True)

    return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]

according to annoy documentation get_nns_by_vector with include_distances=True will return the distances and not the square power of the distance (this was changed since aug 2016):

a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.

link

also:
Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))
link

so this means that in order to calculate the cosine similarity correctly we should do this:
return [(self.labels[ids[i]], 1 - distances[i]^2 / 2) for i in range(len(ids))]

@gojomo gojomo changed the title cosine simailrity is incorrect annoy.py conversion of cosine distance to cosine similarity is incorrect Feb 7, 2023
@gojomo
Copy link
Collaborator

gojomo commented Feb 7, 2023

As this Gensim code was last touched July 2016, & the Annoy library says it changed its behavior August 2016, I can see how this is out-of-sync.

An effect would be that the max distance returned by get_nns_by_vector() of 2.0 will be turned into a cosine-similarity of 0.0 instead of the correct -1.0 – squashed range of returned values. (However, in no case will the relative ranking of nearest-neighbors be incorrect – which may have contributed to this going unnnoticed for so long.)

The proposed fix looks correct.

@piskvorky piskvorky added bug Issue described a bug impact LOW Low impact on affected users reach LOW Affects only niche use-case users labels Feb 7, 2023
@piskvorky
Copy link
Owner

Thanks @monash8 for the report and investigation. And @gojomo for confirming.

@monash8 are you able to open a PR with the fix?

@piskvorky piskvorky added this to the Next release milestone Feb 7, 2023
@monash8
Copy link
Contributor Author

monash8 commented Feb 8, 2023

Thanks @monash8 for the report and investigation. And @gojomo for confirming.

@monash8 are you able to open a PR with the fix?

sure! @piskvorky can you give me access to create a new branch?

@piskvorky
Copy link
Owner

@monash8 please create a branch in your fork & open a PR again gensim's develop branch from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants