Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

Open
alejohz opened this issue Nov 15, 2022 · 0 comments

Comments

@alejohz
Copy link

alejohz commented Nov 15, 2022

As mentioned in this issue here scipy sparse matrix class has little to no functions available from the Dask Array module. For example, raising TTypeError: _cs_matrix.sum() got an unexpected keyword argument 'keepdims' when using da.sum(sparse_matrix, axis=0) or sparse_matrix.sum()

When using CountVectorizer and HashingVectorizer both return blocks of scipy.sparse_csr.csr_matrix data type.

To interact with those blocks, one has to do a change with the sparse COO module. See https://docs.dask.org/en/latest/array-sparse.html this fixed a problem i was having and trying to correct for multiple hours.

Default datatype should be sparse_coo.core.COO even at the cost if increasing depedencies, due to the fact that the result would be more dask-like and managaeable.

@jrbourbeau jrbourbeau transferred this issue from dask/dask Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant