Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

alejohz · 2022-11-15T23:57:43Z

As mentioned in this issue here scipy sparse matrix class has little to no functions available from the Dask Array module. For example, raising TTypeError: _cs_matrix.sum() got an unexpected keyword argument 'keepdims' when using da.sum(sparse_matrix, axis=0) or sparse_matrix.sum()

When using CountVectorizer and HashingVectorizer both return blocks of scipy.sparse_csr.csr_matrix data type.

To interact with those blocks, one has to do a change with the sparse COO module. See https://docs.dask.org/en/latest/array-sparse.html this fixed a problem i was having and trying to correct for multiple hours.

Default datatype should be sparse_coo.core.COO even at the cost if increasing depedencies, due to the fact that the result would be more dask-like and managaeable.

The text was updated successfully, but these errors were encountered:

github-actions bot added the needs triage label Nov 15, 2022

jrbourbeau transferred this issue from dask/dask Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

alejohz commented Nov 15, 2022

Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO #952

Comments

alejohz commented Nov 15, 2022