-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adding Distance metrics to matutils + Tutorial #656
Conversation
Thanks for the PR. Please add tests and a note in CHANGELOG, then will proceed with review. |
@bhargavvader I don't think the semantic method (LDA, BoW) should have anything to do with the metrics. In other words, the metrics should operate over vectors (scipy.sparse / numpy.array / gensim sequence of 2-tuples). It doesn't matter what methods were used to create these vectors -- a sparse vector from LDA is the same format as sparse vector from BoW. So before writing any tests, please fix that. This API is not good. |
@piskvorky , I've changed my functions to keep what you said in mind. Could you please have a look again before I write any tests? The functions take in a LDA model object only to check the number of topics, now. If it is in a different format without the object passed it will still work though, as long as the lengths of the two passed vectors are the same. |
No models (transformations) should be part of API. Only the end result (vector) matters, not the transformations how we got there. If a function needs the number of features (~vector length), then let users pass that in as |
Noted. I've changed Hellinger such that it doesn't need to accept any inputs apart from two vectors (implementation similar to Apologies for the constant revisions and mistakes! edit: There's quite a bit of code duplication, will remove. |
Yeah, the interface looks cleaner, thanks @bhargavvader . I think it's reviewable now; I'll add my comments soon. Please add the tests, especially around various corner cases (empty inputs, inputs of different types, zero length inputs etc). |
Uses the scipy.stats.entropy method to identify kullback_leibler convergence value. | ||
If the distribution draws from a certain number of docs, that value must be passed. | ||
""" | ||
if scipy.sparse.issparse(vec1) and scipy.sparse.issparse(vec2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when one input is sparse, the other dense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, it'll go straight down to the final else condition - which throws an error because matutils.sparse2full
doesn't work well with some scipy.sparse matrices.
Would you suggest checking for cases when either are sparse, then convert to dense, and go ahead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the performance is good (=not much slowdown, let's say not more than 20%), we can do that. Another option: more code paths for the various input type combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it such that all inputted vectors are made dense and then to lists before working on them, if they aren't already dense/list. This solves the problem of input vectors being in different formats. As for the performance, since the vectors were already eventually being converted to lists, it doesn't really hurt it. It also makes the code look a lot cleaner.
f6ace27
to
434201c
Compare
@tmylk , @piskvorky - could you have a look at the methods once? I've also added an |
""" | ||
if scipy.sparse.issparse(vec): | ||
vec = vec.todense().tolist() | ||
for item in vec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enough to check one item (the first one). Linear scan too slow.
…milarity Conflicts: CHANGELOG.txt
@bhargavvader Ping. Are you planning to incorporate the latest suggestions? |
@tmylk , yeah, I'll try and wrap up this along with the tests for it by next week. |
@piskvorky , could you look at my |
else: | ||
sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum()) | ||
return sim | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: two blanks between module-level methods.
If you addressed my previous comments then it's good to go! Thanks @bhargavvader . @tmylk please check the tests are thorough though, plus maybe do a few manual sanity checks (outside of gensim) to verify no bug in "math business logic" slipped through. |
@piskvorky , made PEP8 changes and comments before code block, should be neater now. :) |
@tmylk , @piskvorky , I've added a python notebook to accompany the methods. Sometimes functionalities like this get missed out on, so I've added some examples for the same so users know where they can use this. Made changes to changelog too, so if the math business logic is alright, it's ready to merge from my end. :) |
Made a few documentation changes - while Jaccard is a Similarity measure, Hellinger and KL are distance metrics. Will make changes in the notebook to reflect the same, and add a small bit on similarities/distances between topic distributions as well document-topic distributions. |
It is best if they are all either distance or similarities. |
What's best depends on the range of I think most people expect similarity to be |
@tmylk , @piskvorky , will shift to using distances for all 3, in the range |
@tmylk , @piskvorky , made changes to the notebook to reflect they are distance metrics and not otherwise. Added an example to find the distance between two topic's word distributions and a few document's topic distributions. Could you have a review to see if there's any ambiguity in the docstring, comments, and math business logic? |
Are these general distance functions, or truly metrics in the technical sense (satisfy triangle inequality)? |
Hellinger and Jaccard are metrics in the technical sense, but KL is not. |
More work on the ipynb is needed but merging in the code to include it in 0.13.0 release. |
Cool, will update notebook in different PR. |
@tmylk , @piskvorky , I've created 3 different metric functions:
Hellinger
andKullback-Leibler
for probability distributions, andJaccard
Similarity for Bag of Words representations.I have a few questions about possible changes, particularly:
The Hellinger and Kullback-Leibler methods right now only takes 2-tuple representations (in numpy array, scipy sparse, or gensim vector representations). Is that ok? Most probability distributions passed should look like that. I can improve it to include dense numpy arrays on input as well as well, if it is expected that people will pass that into the function.
The Kullback-Leibler uses the scipy.sparse.entropy method, which fails if the probability distributions of any of the tuples is 0. For most cases we deal with, this should not be a problem, but if we want to allow 0s to be present in the input, I will have to rewrite the kullback-leibler calculation myself.
Jaccard similarity is primarily only used for bag of words, so I've not bothered to include other kinds of formats on input.
Is this approach for adding similarity functions ok? If it is, I can make the changes you'd suggest, add test cases, and add more of these functions if needed.
This is related to issue #64 .