-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse support when num_best not None in interfaces. Fixes #1294 #1321
Conversation
oops, forgot to add a whitespace after comma. |
Thank you @manneshiva, at first, please write unit-test for check that there is no exception (you can rewrite this code from issue as unittest) and write test for |
@menshikh-iv Thanks for your suggestions. I will write the unit-test asap. Regarding this comment, it is handled for dense inputs. For other(non-full) cases, we have two options:
I chose method (1), because I feel that in case of sparse vectors, length of the sparse 2-tuples << num_documents and thus sorting these tuples directly will take lesser time(which isn't the case with full inputs and therefore handled separately). Does this make sense? Would you rather recommend method (2) ? |
gensim/matutils.py
Outdated
if isinstance(vec, np.ndarray): | ||
return full2sparse_clipped(vec, topn, eps) | ||
vec_sparse = any2sparse(vec, eps) | ||
return sorted(vec_sparse, key=lambda x: x[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only return topn
Could you please benchmark the performance of 1 and 2 for, say, 1% of non-zero elements in a 10k long array? |
@menshikh-iv @tmylk In the process of re-writing the code from the issue for the unit test, I realized a fault in my/our understanding of the problem.
As seen from this unit test, 'maintain_sparsity' indicates the type of the output(whether scipy sparse or not) and has nothing to do with the type of the input. In the case when num_best is not None, the output is expected to be a list of reverse sorted tuples(index,similarity). So basically, the problem is not the sparse input but rather asking for a sparse output when num_best=some_n. It does not make sense to return a scipy sparse matrix in the case when num_best is not None and maintain_sparsity is True, which is exactly what was assumed in the present code. The fix for this problem is to simply raise an error when trying to set maintain_sparsity to True and num_best=n. Should have caught this before. Or am I missing something? |
The goal of this issue was to support |
@tmylk Suppose we were to support
The sparse output is the same as the dense output as seen above. But it is not possible to convert this list of sparse rows to a sparse matrix since the shape of each row is different(depends on the highest index in topn indices). Is it ok to return a list of sparse rows? What is the expected output? |
@tmylk Found a way to return a sparse csr_matrix (possible even with variable shape of each row) instead of list of sparse rows as mentioned. But it loses the ordering of the values in each row. |
@tmylk Tried and tested a way to create a sparse scipy matrix which preserves the order. So the current output will be a scipy sparse order preserving matrix. Will code it and push the changes soon. Thanks for your patience. |
@tmylk I have fixed the issue with the following changes:
The output type (for |
@tmylk Let me know if any changes are required. |
gensim/test/test_similarities.py
Outdated
"""Tests that sparsity is correctly maintained when maintain_sparsity=True and num_best is not None""" | ||
num_features = len(dictionary) | ||
|
||
index = self.cls(corpus, num_features=num_features, num_best=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit maintain_sparsity=False
will add clarity.
@manneshiva I am concerned about the call to Please preserve sparsity when selecting num_best |
gensim/interfaces.py
Outdated
@@ -222,6 +222,12 @@ def __getitem__(self, query): | |||
if self.num_best is None: | |||
return result | |||
|
|||
# if maintain_sparity is True, result is scipy sparse. Sort, clip the | |||
# topn and return as a scipy sparse matrix. | |||
if hasattr(self, 'maintain_sparsity'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getattr
accepts a default value.
gensim/matutils.py
Outdated
@@ -166,6 +166,39 @@ def any2sparse(vec, eps=1e-9): | |||
return [(int(fid), float(fw)) for fid, fw in vec if np.abs(fw) > eps] | |||
|
|||
|
|||
def scipy2scipy_clipped(matrix, topn, eps=1e-9): | |||
""" | |||
Return a scipy.sparse vector/matrix consisting of 'topn' elements of greatest magnitude(abs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the greatest magnitude (absolute value)
.
gensim/matutils.py
Outdated
# Return clipped sparse vector if input is a sparse vector. | ||
if matrix.shape[0] == 1: | ||
# use np.argpartition/argsort and only form tuples that are actually returned. | ||
biggest = argsort(abs(matrix).data, topn, reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abs(matrix.data)
should be much faster than abs(matrix).data
(numpy vs scipy).
gensim/matutils.py
Outdated
matrix_indptr = [0] | ||
for v in matrix: | ||
# Sort and clip each row vector first. | ||
biggest = argsort(abs(v).data, topn, reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dtto.
Even faster, but slightly hairier, to call abs
just once, on the entire matrix data, rather than once per row.
@tmylk cc @piskvorky Will push the required changes soon.
I am not sure about what you mean. matrix.data returns all nonzero values and not the full length(#docs) array thus still preserves sparsity. Am I missing something? |
Please try run code from issue and fix this problem @manneshiva
|
@menshikh-iv Made required changes and tested on code from issue ( |
Thank you @manneshiva |
…one. Fix piskvorky#1294 (piskvorky#1321) * added any2sparse_clipped() function * changed full2sparse_clipped to any2sparse_clipped in __getitem__ * added missing whitespace * return topn from any2sparse_clipped() * efficient any2sparse_clipped implementation * added unit test for any2sparse_clipped * function call corrected * removed any2sparse_clipped and added scipy2scipy_clipped * added new code path for maintain_sparsity * added unit tests for new function and issue * fixed flake8 errors * fixed matrix_indptr * added requested changes * replaced hasattr with getattr * call abs() once for entire matrix in scipy2scipy_clipped * removed matrix.sort_indices and removed indptr while calling argsort
Fixes #1294. Added a new function ( any2sparse_clipped() ), in matutils by extending any2sparse() to return the topn elements of greatest magnitude. Also, replaced full2sparse_clipped in interfaces.py to any2sparse_clipped to handle sparse cases when num_best isn't None.
I am fairly new to contributing to open source and so do not have much idea about testing my commits. I have only tested/verified against the code to reproduce the issue and it seemed to be working. Let me know in case I need to do anything else.