[MRG] Adding Distance metrics to matutils + Tutorial #656

bhargavvader · 2016-04-04T17:10:48Z

@tmylk , @piskvorky , I've created 3 different metric functions: Hellinger and Kullback-Leibler for probability distributions, and Jaccard Similarity for Bag of Words representations.

I have a few questions about possible changes, particularly:

The Hellinger and Kullback-Leibler methods right now only takes 2-tuple representations (in numpy array, scipy sparse, or gensim vector representations). Is that ok? Most probability distributions passed should look like that. I can improve it to include dense numpy arrays on input as well as well, if it is expected that people will pass that into the function.
The Kullback-Leibler uses the scipy.sparse.entropy method, which fails if the probability distributions of any of the tuples is 0. For most cases we deal with, this should not be a problem, but if we want to allow 0s to be present in the input, I will have to rewrite the kullback-leibler calculation myself.

Jaccard similarity is primarily only used for bag of words, so I've not bothered to include other kinds of formats on input.

Is this approach for adding similarity functions ok? If it is, I can make the changes you'd suggest, add test cases, and add more of these functions if needed.

This is related to issue #64 .

tmylk · 2016-04-04T18:17:48Z

Thanks for the PR. Please add tests and a note in CHANGELOG, then will proceed with review.

piskvorky · 2016-04-04T22:08:58Z

@bhargavvader I don't think the semantic method (LDA, BoW) should have anything to do with the metrics.

In other words, the metrics should operate over vectors (scipy.sparse / numpy.array / gensim sequence of 2-tuples). It doesn't matter what methods were used to create these vectors -- a sparse vector from LDA is the same format as sparse vector from BoW.

So before writing any tests, please fix that. This API is not good.

bhargavvader · 2016-04-05T10:46:57Z

@piskvorky , I've changed my functions to keep what you said in mind. Could you please have a look again before I write any tests? The functions take in a LDA model object only to check the number of topics, now. If it is in a different format without the object passed it will still work though, as long as the lengths of the two passed vectors are the same.

piskvorky · 2016-04-05T12:52:23Z

No models (transformations) should be part of API. Only the end result (vector) matters, not the transformations how we got there.

If a function needs the number of features (~vector length), then let users pass that in as num_features (for ex: num_features=lda.num_topics). No need for lda itself.

bhargavvader · 2016-04-05T17:56:23Z

Noted. I've changed Hellinger such that it doesn't need to accept any inputs apart from two vectors (implementation similar to cossim function). It works fine for all the kinds of inputs I've thrown at it, and I can write more comprehensive tests if it's looking ok.
As for Kullback-Leibler, I removed the need to pass anything but num_of_features. Is it looking better?
Jaccard similarity also works for the basic tests I ran on it.

Apologies for the constant revisions and mistakes!

edit: There's quite a bit of code duplication, will remove.

piskvorky · 2016-04-06T03:28:37Z

Yeah, the interface looks cleaner, thanks @bhargavvader .

I think it's reviewable now; I'll add my comments soon. Please add the tests, especially around various corner cases (empty inputs, inputs of different types, zero length inputs etc).

piskvorky · 2016-04-06T03:29:17Z

gensim/matutils.py

+    Uses the scipy.stats.entropy method to identify kullback_leibler convergence value.
+    If the distribution draws from a certain number of docs, that value must be passed.
+    """
+    if scipy.sparse.issparse(vec1) and scipy.sparse.issparse(vec2):


What happens when one input is sparse, the other dense?

As of now, it'll go straight down to the final else condition - which throws an error because matutils.sparse2full doesn't work well with some scipy.sparse matrices.

Would you suggest checking for cases when either are sparse, then convert to dense, and go ahead?

Yes, if the performance is good (=not much slowdown, let's say not more than 20%), we can do that. Another option: more code paths for the various input type combinations.

I've made it such that all inputted vectors are made dense and then to lists before working on them, if they aren't already dense/list. This solves the problem of input vectors being in different formats. As for the performance, since the vectors were already eventually being converted to lists, it doesn't really hurt it. It also makes the code look a lot cleaner.

…file

bhargavvader · 2016-04-16T21:40:41Z

@tmylk , @piskvorky - could you have a look at the methods once?
I've changed it to take into account the previous comments, and removed quite a bit of previous code duplication, the methods are much cleaner now.

I've also added an isbow method to check for bag of words representation. I use it in my functions and figured it might come in handy sometime later. Is that ok?

piskvorky · 2016-04-17T01:56:12Z

gensim/matutils.py

+    """
+    if scipy.sparse.issparse(vec):
+        vec = vec.todense().tolist()
+    for item in vec:


Enough to check one item (the first one). Linear scan too slow.

…milarity Conflicts: CHANGELOG.txt

…milarity

tmylk · 2016-04-29T16:11:00Z

@bhargavvader Ping. Are you planning to incorporate the latest suggestions?

bhargavvader · 2016-04-29T21:19:42Z

@tmylk , yeah, I'll try and wrap up this along with the tests for it by next week.

…milarity

bhargavvader · 2016-05-01T01:25:09Z

@piskvorky , could you look at my isbow? It works for all the cases I have, and if it is fine in theory, I'll include more tests.

piskvorky · 2016-05-30T00:57:08Z

gensim/matutils.py

+    else:
+        sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
+        return sim
+


PEP8: two blanks between module-level methods.

piskvorky · 2016-05-30T01:02:49Z

If you addressed my previous comments then it's good to go! Thanks @bhargavvader .

@tmylk please check the tests are thorough though, plus maybe do a few manual sanity checks (outside of gensim) to verify no bug in "math business logic" slipped through.

bhargavvader · 2016-05-30T07:32:43Z

@piskvorky , made PEP8 changes and comments before code block, should be neater now. :)
@tmylk , if this looks okay then I can make a small ipython notebook with a few examples of where this would be useful.

…milarity

bhargavvader · 2016-06-03T10:08:30Z

@tmylk , @piskvorky , I've added a python notebook to accompany the methods. Sometimes functionalities like this get missed out on, so I've added some examples for the same so users know where they can use this.

Made changes to changelog too, so if the math business logic is alright, it's ready to merge from my end. :)

bhargavvader · 2016-06-09T12:41:50Z

Made a few documentation changes - while Jaccard is a Similarity measure, Hellinger and KL are distance metrics. 1 - distance = similarity in these cases.

Will make changes in the notebook to reflect the same, and add a small bit on similarities/distances between topic distributions as well document-topic distributions.

tmylk · 2016-06-09T15:14:06Z

It is best if they are all either distance or similarities.
It is best to use
similarity = [1/(1+distance)]

piskvorky · 2016-06-10T01:45:25Z

What's best depends on the range of distance -- is it <0, inf>, or <0, 1>, or something else?

I think most people expect similarity to be <0, 1> (though cossim is <-1, 1>, for example).

bhargavvader · 2016-06-10T05:52:14Z

@tmylk , @piskvorky , will shift to using distances for all 3, in the range <0,1> where less distance (i.e, value closer to 0) means it has higher similarity.

bhargavvader · 2016-06-10T13:47:29Z

@tmylk , @piskvorky , made changes to the notebook to reflect they are distance metrics and not otherwise. Added an example to find the distance between two topic's word distributions and a few document's topic distributions.

Could you have a review to see if there's any ambiguity in the docstring, comments, and math business logic?

piskvorky · 2016-06-10T14:05:48Z

Are these general distance functions, or truly metrics in the technical sense (satisfy triangle inequality)?

bhargavvader · 2016-06-10T15:08:46Z

Hellinger and Jaccard are metrics in the technical sense, but KL is not.

tmylk · 2016-06-21T23:12:56Z

More work on the ipynb is needed but merging in the code to include it in 0.13.0 release.

bhargavvader · 2016-06-22T03:56:42Z

Cool, will update notebook in different PR.

Added 3 similarity metrics to matutils.

fccd2db

bhargavvader added 2 commits April 5, 2016 14:54

Removed specific BOW input, changed Hellinger

2d4643f

Removed two lines from comments

41129fc

Removed some parameters for Hellinger, Kullback-leibler methods

fa2949f

bhargavvader changed the title ~~Added 3 similarity metrics to matutils.~~ [WIP] Adding similarity/distance metrics to matutils. Apr 5, 2016

piskvorky reviewed Apr 6, 2016
View reviewed changes

Addressed some of the comments; need to improve jaccard and add test …

434201c

…file

bhargavvader force-pushed the Similarity branch from f6ace27 to 434201c Compare April 16, 2016 21:10

Different approach to methods; added isbow

91e3477

piskvorky reviewed Apr 17, 2016
View reviewed changes

bhargavvader added 2 commits April 20, 2016 21:27

Merge branch 'develop' of https://github.com/piskvorky/gensim into Si…

fc73637

…milarity Conflicts: CHANGELOG.txt

Merge branch 'develop' of https://github.com/piskvorky/gensim into Si…

8975f69

…milarity

bhargavvader added 2 commits May 1, 2016 06:29

Merge branch 'develop' of https://github.com/piskvorky/gensim into Si…

29cbed8

…milarity

Changed isbow

c84fa7a

piskvorky reviewed May 30, 2016
View reviewed changes

PEP8, readability changes

c9033e6

bhargavvader added 3 commits June 3, 2016 15:26

Similarity Tutorial

dd15294

Merge branch 'develop' of https://github.com/piskvorky/gensim into Si…

edf1ce9

…milarity

Changelog

5ed33c6

bhargavvader changed the title ~~[MRG] Adding similarity/distance metrics to matutils.~~ [MRG] Adding similarity/distance metrics to matutils + Tutorial Jun 3, 2016

bhargavvader changed the title ~~[MRG] Adding similarity/distance metrics to matutils + Tutorial~~ [MRG] Adding Distance metrics to matutils + Tutorial Jun 9, 2016

Documentation Changes

76843c8

tmylk mentioned this pull request Jun 10, 2016

testGetDocumentTopics and testTermTopics fail in Appveyor and OSX in MultiCoreLDA #740

Closed

bhargavvader added 4 commits June 10, 2016 11:25

Changelog

6bbe41c

Made all distances

4ba0a38

Changelog

91180a7

Made Notebook Changes

c7e4b32

Made changes to notebook regarding KL

9550f3c

tmylk merged commit c00efcf into piskvorky:develop Jun 21, 2016

bhargavvader deleted the Similarity branch June 22, 2016 03:56

tmylk mentioned this pull request Sep 18, 2016

Add other distance measures to Similarity #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Adding Distance metrics to matutils + Tutorial #656

[MRG] Adding Distance metrics to matutils + Tutorial #656

bhargavvader commented Apr 4, 2016

tmylk commented Apr 4, 2016

piskvorky commented Apr 4, 2016

bhargavvader commented Apr 5, 2016

piskvorky commented Apr 5, 2016

bhargavvader commented Apr 5, 2016

piskvorky commented Apr 6, 2016

piskvorky Apr 6, 2016

bhargavvader Apr 6, 2016

piskvorky Apr 7, 2016

bhargavvader Apr 16, 2016

bhargavvader commented Apr 16, 2016 •

edited

Loading

piskvorky Apr 17, 2016

tmylk commented Apr 29, 2016

bhargavvader commented Apr 29, 2016 •

edited

Loading

bhargavvader commented May 1, 2016

piskvorky May 30, 2016

piskvorky commented May 30, 2016

bhargavvader commented May 30, 2016 •

edited

Loading

bhargavvader commented Jun 3, 2016

bhargavvader commented Jun 9, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

piskvorky commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

tmylk commented Jun 21, 2016

bhargavvader commented Jun 22, 2016

[MRG] Adding Distance metrics to matutils + Tutorial #656

[MRG] Adding Distance metrics to matutils + Tutorial #656

Conversation

bhargavvader commented Apr 4, 2016

tmylk commented Apr 4, 2016

piskvorky commented Apr 4, 2016

bhargavvader commented Apr 5, 2016

piskvorky commented Apr 5, 2016

bhargavvader commented Apr 5, 2016

piskvorky commented Apr 6, 2016

piskvorky Apr 6, 2016

Choose a reason for hiding this comment

bhargavvader Apr 6, 2016

Choose a reason for hiding this comment

piskvorky Apr 7, 2016

Choose a reason for hiding this comment

bhargavvader Apr 16, 2016

Choose a reason for hiding this comment

bhargavvader commented Apr 16, 2016 • edited Loading

piskvorky Apr 17, 2016

Choose a reason for hiding this comment

tmylk commented Apr 29, 2016

bhargavvader commented Apr 29, 2016 • edited Loading

bhargavvader commented May 1, 2016

piskvorky May 30, 2016

Choose a reason for hiding this comment

piskvorky commented May 30, 2016

bhargavvader commented May 30, 2016 • edited Loading

bhargavvader commented Jun 3, 2016

bhargavvader commented Jun 9, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

piskvorky commented Jun 10, 2016

bhargavvader commented Jun 10, 2016

tmylk commented Jun 21, 2016

bhargavvader commented Jun 22, 2016

bhargavvader commented Apr 16, 2016 •

edited

Loading

bhargavvader commented Apr 29, 2016 •

edited

Loading

bhargavvader commented May 30, 2016 •

edited

Loading