-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Make normalization an explicit transformation. #649
Conversation
@piskvorky @tmylk I've created |
I've added initial tests for just checking the logic. Input validation testing and testing for the Similarity class constructor still remains. |
if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array | ||
vec = vec.tocsr() | ||
veclen = numpy.sqrt(numpy.sum(vec.data ** 2)) | ||
if norm == 'l0': | ||
veclen = len(vec.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L_0 'norm' is not homogeneous so it is impossible to scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes true.... Could you please elaborate more on the scaling part? Will it be affecting efficiency since as the corpus size increases we cannot use the fact that f(ax) = (a^k)f(x)
? Should I remove the support for zero "norm"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to confuse you, meant scaling in mathematical not IT sense.
What number do we need to divide vector a=(5,5)
by in order to have l_0_norm(a)=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I think it would be best to remove L0 norm then and keep the normalization just for l1 and l2. Should I proceed with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Thanks.
On Mon, Apr 4, 2016 at 8:04 PM, Devashish Deshpande <
notifications@github.com> wrote:
In gensim/matutils.py
#649 (comment):if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array vec = vec.tocsr()
veclen = numpy.sqrt(numpy.sum(vec.data *\* 2))
if norm == 'l0':
veclen = len(vec.data)
Thanks for the explanation! I think it would be best to remove L0 norm
then and keep the normalization just for l1 and l2. Should I proceed with
it?—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
https://github.com/piskvorky/gensim/pull/649/files/8b27eb4f46e0854fefdb4b55b6cea6c42c11a14d#r58429753
@tmylk I've removed L0 norm and added some more tests to |
ef103b5
to
739d868
Compare
@tmylk @piskvorky I've added further tests to cover all input types. I think this pr is ready for review. |
Removed merge conflicts. |
@dsquareindia Hey, why did you close this pull request? It is a useful feature. |
Sorry for that; I'll open it again. |
Removed redundant use of corpus, rebased and squashed. Could someone please review this? Thanks! |
return list(vec) | ||
if norm == 'l1': | ||
length = float(sum(abs(val) for _, val in vec)) | ||
assert length > 0.0, "Document contains all zero entries" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had just added that to make it symmetric with the L2 case. I should be returning the vector unchanged in case of 0 length right? Should I proceed with the change?
return ret_normalized_vec(vec, length) | ||
if norm == 'l2': | ||
length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec)) | ||
assert length > 0.0, "sparse documents must not contain any explicit zero entries" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I change this to returning the vector unchanged in case of zero length as we are doing with other input types?
@tmylk I have split the test suite into multiple tests to ease error checking. I have also removed the |
@piskvorky could you please review this? I think it's ready for merge. |
""" | ||
Calculates the norm by calling matutils.unitvec with the norm parameter. | ||
""" | ||
logger.info("Normalizing...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message is not very illuminating to users... try being more specific.
Only did a quick scan, looks nicely done, but I'll leave the thorough review to @tmylk. My only nitpick would be |
self.num_nnz = numnnz | ||
self.norms = norms | ||
|
||
def normalize(self, bow): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky should this rather be a __getitem__()
since a transformation here is defined as something which acts like transformation[doc]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We can also let users use either -- a named function or [ ]
.
@tmylk @piskvorky I've addressed all the comments. Hopefully it's much better now! Could you please review :) |
@tmylk @piskvorky thanks a ton for the reviews! 🍻 |
@piskvorky I was thinking of writing a notebook tutorial for this feature however I can't think of a practical use case for it. What would be the best way to go about this? |
Not sure... @tmylk ? Maybe indexing some corpus with/without normalization, showing how the returned "most similar" docs (or words, or embeddings... basically any vectors) differ? |
Addresses issue #69. What this PR includes:
TODO:
@piskvorky @tmylk Could you please tell me if I'm going on the right track?