Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding Distance metrics to matutils + Tutorial #656

Merged
merged 33 commits into from
Jun 21, 2016

Conversation

bhargavvader
Copy link
Contributor

@tmylk , @piskvorky , I've created 3 different metric functions: Hellinger and Kullback-Leibler for probability distributions, and Jaccard Similarity for Bag of Words representations.

I have a few questions about possible changes, particularly:

  1. The Hellinger and Kullback-Leibler methods right now only takes 2-tuple representations (in numpy array, scipy sparse, or gensim vector representations). Is that ok? Most probability distributions passed should look like that. I can improve it to include dense numpy arrays on input as well as well, if it is expected that people will pass that into the function.

  2. The Kullback-Leibler uses the scipy.sparse.entropy method, which fails if the probability distributions of any of the tuples is 0. For most cases we deal with, this should not be a problem, but if we want to allow 0s to be present in the input, I will have to rewrite the kullback-leibler calculation myself.

Jaccard similarity is primarily only used for bag of words, so I've not bothered to include other kinds of formats on input.

Is this approach for adding similarity functions ok? If it is, I can make the changes you'd suggest, add test cases, and add more of these functions if needed.

This is related to issue #64 .

@tmylk
Copy link
Contributor

tmylk commented Apr 4, 2016

Thanks for the PR. Please add tests and a note in CHANGELOG, then will proceed with review.

@piskvorky
Copy link
Owner

@bhargavvader I don't think the semantic method (LDA, BoW) should have anything to do with the metrics.

In other words, the metrics should operate over vectors (scipy.sparse / numpy.array / gensim sequence of 2-tuples). It doesn't matter what methods were used to create these vectors -- a sparse vector from LDA is the same format as sparse vector from BoW.

So before writing any tests, please fix that. This API is not good.

@bhargavvader
Copy link
Contributor Author

@piskvorky , I've changed my functions to keep what you said in mind. Could you please have a look again before I write any tests? The functions take in a LDA model object only to check the number of topics, now. If it is in a different format without the object passed it will still work though, as long as the lengths of the two passed vectors are the same.

@piskvorky
Copy link
Owner

No models (transformations) should be part of API. Only the end result (vector) matters, not the transformations how we got there.

If a function needs the number of features (~vector length), then let users pass that in as num_features (for ex: num_features=lda.num_topics). No need for lda itself.

@bhargavvader
Copy link
Contributor Author

Noted. I've changed Hellinger such that it doesn't need to accept any inputs apart from two vectors (implementation similar to cossim function). It works fine for all the kinds of inputs I've thrown at it, and I can write more comprehensive tests if it's looking ok.
As for Kullback-Leibler, I removed the need to pass anything but num_of_features. Is it looking better?
Jaccard similarity also works for the basic tests I ran on it.

Apologies for the constant revisions and mistakes!

edit: There's quite a bit of code duplication, will remove.

@bhargavvader bhargavvader changed the title Added 3 similarity metrics to matutils. [WIP] Adding similarity/distance metrics to matutils. Apr 5, 2016
@piskvorky
Copy link
Owner

Yeah, the interface looks cleaner, thanks @bhargavvader .

I think it's reviewable now; I'll add my comments soon. Please add the tests, especially around various corner cases (empty inputs, inputs of different types, zero length inputs etc).

Uses the scipy.stats.entropy method to identify kullback_leibler convergence value.
If the distribution draws from a certain number of docs, that value must be passed.
"""
if scipy.sparse.issparse(vec1) and scipy.sparse.issparse(vec2):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when one input is sparse, the other dense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, it'll go straight down to the final else condition - which throws an error because matutils.sparse2full doesn't work well with some scipy.sparse matrices.

Would you suggest checking for cases when either are sparse, then convert to dense, and go ahead?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the performance is good (=not much slowdown, let's say not more than 20%), we can do that. Another option: more code paths for the various input type combinations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it such that all inputted vectors are made dense and then to lists before working on them, if they aren't already dense/list. This solves the problem of input vectors being in different formats. As for the performance, since the vectors were already eventually being converted to lists, it doesn't really hurt it. It also makes the code look a lot cleaner.

@bhargavvader
Copy link
Contributor Author

bhargavvader commented Apr 16, 2016

@tmylk , @piskvorky - could you have a look at the methods once?
I've changed it to take into account the previous comments, and removed quite a bit of previous code duplication, the methods are much cleaner now.

I've also added an isbow method to check for bag of words representation. I use it in my functions and figured it might come in handy sometime later. Is that ok?

"""
if scipy.sparse.issparse(vec):
vec = vec.todense().tolist()
for item in vec:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enough to check one item (the first one). Linear scan too slow.

@tmylk
Copy link
Contributor

tmylk commented Apr 29, 2016

@bhargavvader Ping. Are you planning to incorporate the latest suggestions?

@bhargavvader
Copy link
Contributor Author

bhargavvader commented Apr 29, 2016

@tmylk , yeah, I'll try and wrap up this along with the tests for it by next week.

@bhargavvader
Copy link
Contributor Author

@piskvorky , could you look at my isbow? It works for all the cases I have, and if it is fine in theory, I'll include more tests.

else:
sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
return sim

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: two blanks between module-level methods.

@piskvorky
Copy link
Owner

If you addressed my previous comments then it's good to go! Thanks @bhargavvader .

@tmylk please check the tests are thorough though, plus maybe do a few manual sanity checks (outside of gensim) to verify no bug in "math business logic" slipped through.

@bhargavvader
Copy link
Contributor Author

bhargavvader commented May 30, 2016

@piskvorky , made PEP8 changes and comments before code block, should be neater now. :)
@tmylk , if this looks okay then I can make a small ipython notebook with a few examples of where this would be useful.

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , I've added a python notebook to accompany the methods. Sometimes functionalities like this get missed out on, so I've added some examples for the same so users know where they can use this.

Made changes to changelog too, so if the math business logic is alright, it's ready to merge from my end. :)

@bhargavvader bhargavvader changed the title [MRG] Adding similarity/distance metrics to matutils. [MRG] Adding similarity/distance metrics to matutils + Tutorial Jun 3, 2016
@bhargavvader bhargavvader changed the title [MRG] Adding similarity/distance metrics to matutils + Tutorial [MRG] Adding Distance metrics to matutils + Tutorial Jun 9, 2016
@bhargavvader
Copy link
Contributor Author

Made a few documentation changes - while Jaccard is a Similarity measure, Hellinger and KL are distance metrics. 1 - distance = similarity in these cases.

Will make changes in the notebook to reflect the same, and add a small bit on similarities/distances between topic distributions as well document-topic distributions.

@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

It is best if they are all either distance or similarities.
It is best to use
similarity = [1/(1+distance)]

@piskvorky
Copy link
Owner

What's best depends on the range of distance -- is it <0, inf>, or <0, 1>, or something else?

I think most people expect similarity to be <0, 1> (though cossim is <-1, 1>, for example).

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , will shift to using distances for all 3, in the range <0,1> where less distance (i.e, value closer to 0) means it has higher similarity.

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , made changes to the notebook to reflect they are distance metrics and not otherwise. Added an example to find the distance between two topic's word distributions and a few document's topic distributions.

Could you have a review to see if there's any ambiguity in the docstring, comments, and math business logic?

@piskvorky
Copy link
Owner

Are these general distance functions, or truly metrics in the technical sense (satisfy triangle inequality)?

@bhargavvader
Copy link
Contributor Author

Hellinger and Jaccard are metrics in the technical sense, but KL is not.

@tmylk
Copy link
Contributor

tmylk commented Jun 21, 2016

More work on the ipynb is needed but merging in the code to include it in 0.13.0 release.

@tmylk tmylk merged commit c00efcf into piskvorky:develop Jun 21, 2016
@bhargavvader
Copy link
Contributor Author

Cool, will update notebook in different PR.

@bhargavvader bhargavvader deleted the Similarity branch June 22, 2016 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants