Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix return dtype for matutils.unitvec according to input dtype. Fix #1722 #1992

Merged
merged 30 commits into from
Apr 16, 2018
Merged

Conversation

o-P-o
Copy link
Contributor

@o-P-o o-P-o commented Mar 21, 2018

This returns the float32 dtypes for float32 inputs for both normal and sparse arrays. The output unit vectors are also the same as those outputted from doing manual unit vector calculations.

How does this look? Any further suggestions for improvement or why this isn't acceptable?

o-P-o added 5 commits January 31, 2018 20:59
As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output.
Tests to ensure float output for both float and integer inputs.
Parameters
----------
vec : {numpy.ndarray, scipy.sparse, list of (int, float)}
Input vector in any format
norm : {'l1', 'l2'}, optional
Normalization that will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return empty lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if scipy.sparse.issparse(vec):
vec = vec.tocsr()
if norm == 'l1':
veclen = np.sum(np.abs(vec.data))
if norm == 'l2':
veclen = np.sqrt(np.sum(vec.data ** 2))
if veclen > 0.0:
return vec / veclen
if np.issubdtype(vec.dtype, np.int) == True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to == True

else:
return vec

if isinstance(vec, np.ndarray):
vec = np.asarray(vec, dtype=float)
vec = np.asarray(vec, dtype=vec.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this have no sense (because later you'll cast it again)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - this pretty much seems like a no-op, effectively. Does vec = np.asarray(vec, dtype=vec.dtype) have any effect on vec at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I forgot about this - I will remove it on the next commit.

class UnitvecTestCase(unittest.TestCase):

def manual_unitvec(self, vec):
self.vec = vec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use 4 spaces for indentation

@@ -0,0 +1,27 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created pull request there

self.vec /= np.sqrt(sum_vec_squared)
return self.vec

def test_unitvec(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about different vectors (sparse) + different types (more floats + int too)?

@menshikh-iv
Copy link
Contributor

@o-P-o also, please resolve merge-conflict

@menshikh-iv
Copy link
Contributor

CC: @jayantj please have a look

@@ -667,45 +667,54 @@ def ret_log_normalize_vec(vec, axis=1):

def unitvec(vec, norm='l2'):
"""Scale a vector to unit length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't make unrelated changes, this empty line(s) is correct by docstring convention.

else:
return vec

if isinstance(vec, np.ndarray):
vec = np.asarray(vec, dtype=float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this line is needed especially for blas_* calls, please return it

@@ -0,0 +1,187 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks strange, can you merge fresh develop to current branch please?

msg = "dirichlet_expectation_2d failed for dtype={}".format(dtype)
self.assertTrue(np.allclose(known_good, test_values), msg)

class UnitvecTestCase(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a name?

return self.vec

def test_inputs(self):
input_dtypes = [np.float32, np.float64, np.int32, np.int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add float and int (not numpy, python default)


def test_inputs(self):
input_dtypes = [np.float32, np.float64, np.int32, np.int64]
input_arrtypes = ['sparse', 'normal']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dense better name (instead of normal)

input_vector = np.random.randint(10, size=5).astype(dtype_)
unit_vector = unitvec_with_bug.unitvec(input_vector)
man_unit_vector = self.manual_unitvec(input_vector)
self.assertTrue(np.allclose(unit_vector, man_unit_vector))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check the dtype for this case too (you know that this will be default)

@menshikh-iv
Copy link
Contributor

@o-P-o please resolve merge-conflict first before you start to change a code according to my review.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -668,7 +668,7 @@ def ret_log_normalize_vec(vec, axis=1):

def unitvec(vec, norm='l2', return_norm=False):
"""Scale a vector to unit length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no leading spaces (here and everywhere)

if scipy.sparse.issparse(vec):
vec = vec.tocsr()
if norm == 'l1':
veclen = np.sum(np.abs(vec.data))
if norm == 'l2':
veclen = np.sqrt(np.sum(vec.data ** 2))
if veclen > 0.0:
if return_norm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert existing code please (you shouldn't remove it)

o-P-o added 11 commits March 28, 2018 00:18
Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :)
Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.)
Copy link
Contributor

@jayantj jayantj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @o-P-o thanks a lot for submitting the PR! Good work.
I've left a few comments, could you please fix the issues raised so that we can go ahead and merge?

self.vec /= np.sqrt(sum_vec_squared)
return self.vec

def test_inputs(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should split this test into multiple tests (one per combination of (arr_type, dtype) maybe) so the logic is simpler and so that it's possible to look at a failing test log and see exactly what kind of input caused the test to fail.

@@ -141,6 +142,43 @@ def testDirichletExpectation(self):
self.assertTrue(np.allclose(known_good, test_values), msg)


class UnitvecTestCase(unittest.TestCase):
# test unitvec
def manual_unitvec(self, vec):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely should be simplified - why use self at all?
IMO should just be -

vec = vec.astype(np.float)
if sparse.issparse(vec):
    vec_sum_of_squares = vec.multiply(vec)
    unit = 1. / np.sqrt(vec_sum_of_squares.sum())
    return vec.multiply(unit)
else:
    sum_vec_squared = np.sum(vec ** 2)
    vec /= np.sqrt(sum_vec_squared)
    return vec

if norm == 'l1':
veclen = np.sum(np.abs(vec))
if norm == 'l2':
veclen = blas_nrm2(vec)
if veclen > 0.0:
if return_norm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire construct (and a similar construct above) seems to have some unnecessary redundant code. We could simplify to something like -

if veclen > 0.0:
    if np.issubdtype(vec.dtype, np.int):
        vec = vec.astype(np.float)
    if return_norm:
        return blas_scal(1.0 / veclen, vec).astype(vec.dtype), veclen
    else:
        return blas_scal(1.0 / veclen, vec).astype(vec.dtype)

o-P-o added 4 commits April 12, 2018 19:58
This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood.
Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested.
vec = vec.astype(np.float) / veclen
else:
vec /= veclen
vec = vec.astype(vec.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confusing. For consistency, I'd prefer -

if np.issubdtype(vec.dtype, np.int):
    vec = vec.astype(np.float) / veclen
else:
    vec = vec.astype(vec.dtype) / veclen

Does that make sense?
In fact, if dividing by veclen cannot change the dtype (is this the case?), even something like -

if np.issubdtype(vec.dtype, np.int):
    vec = vec.astype(np.float)
vec /= veclen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, sorry I'm really nitpicking here :)
But small things like this cause a slow decline in overall code quality over time.

Copy link
Contributor Author

@o-P-o o-P-o Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately dividing by veclen actually changes the dtype, causing the original problem to manifest itself (i.e. float32 inputs outputting float64). Even trying something like this:

if np.issubdtype(vec.dtype, np.int):
    vec = vec.astype(np.float) / veclen
else:
    vec = vec.astype(vec.dtype) / veclen.astype(vec.dtype)

causes the same problem. This is why I divide by veclen and then cast the dtype. Do you have any suggestions to get around this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I really like your second suggestion, and it passes all tests:

if np.issubdtype(vec.dtype, np.int):
    vec = vec.astype(np.float) 
vec /= veclen

That's very neat indeed, I'll commit it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't worry about the nitpicking! When I am experienced enough to pick up on such things I'm sure I will be nitpicking like this haha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and the explanation! Looks good.

return vec


class UnitvecTestCase(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reorganizing the tests! Looks much better now IMO

@menshikh-iv
Copy link
Contributor

Congratulation @o-P-o with first contribution 🥇

@menshikh-iv menshikh-iv changed the title Bug-fix attempt (matutils.unitvec): input float32 returning different dtype Fix return dtype for matutils.unitvec according to input dtype. Fix #1722 Apr 16, 2018
@menshikh-iv menshikh-iv merged commit 8daace2 into piskvorky:develop Apr 16, 2018
@o-P-o
Copy link
Contributor Author

o-P-o commented Apr 16, 2018

Thanks for assisting my first ever merge guys!

darindf pushed a commit to darindf/gensim that referenced this pull request Apr 23, 2018
…iskvorky#1722 (piskvorky#1992)

* matutils.unitvec bug

As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output.

* matutils.unitvec fix tests

Tests to ensure float output for both float and integer inputs.

* unitvec equal input and return types

* Update and rename test_unitvec to test_unitvec.py

* Update matutils.py

* Update matutils.py

* Update test_unitvec.py

* Update and rename gensim/test_unitvec.py to gensim/test/test_matutils.py

* Update matutils.py

* Update test_matutils.py

* Update test_matutils.py

* Update following review

Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :)
Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.)

* Update: attempt to solve Travis errors

* Update test_matutils.py

* Update matutils.py

* Update matutils.py

* Update test_matutils.py

* Addressing travis errors

* Remove unnecessary dtype assignment

* return_norm statements for array instance case

* Update test_matutils.py

* Reduce line repetition

* Reduce repeated lines

* Update test_matutils.py

* Remove some redundant code from unitvec

This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood.

* UnitvecTestCase update

Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested.

* Small typo fix

* Trailing white-space fix for Travis

* Improve code quality and remove no-op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants