Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

Closed
TomAugspurger opened this issue Oct 13, 2018 · 8 comments · Fixed by #53043
Closed

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

TomAugspurger opened this issue Oct 13, 2018 · 8 comments · Fixed by #53043
Assignees
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Sparse Sparse Data Type

Comments

@TomAugspurger
Copy link
Contributor

This has confused me for a while, but we apparently (sometimes?) allow a fill_value whose dtype is not the same as sp_values.dtype

In [20]: a = pd.SparseArray([1, 2, 3], fill_value=1.0)

In [21]: a
Out[21]:
[1.0, 2, 3]
Fill: 1.0
IntIndex
Indices: array([1, 2], dtype=int32)

In [22]: a.sp_values.dtype
Out[22]: dtype('int64')

This can lead to confusing behavior when doing operations.

I suspect a primary motivation was supporting sparse integer values with NaN for a fill value. We should investigate what's tested, part of the API, and useful to users.

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Oct 13, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 13, 2018
@jorisvandenbossche
Copy link
Member

we apparently (sometimes?) allow a fill_value whose dtype is not the same as sp_values.dtype

It was indeed allowed in some places, but in previous releases, I think it was mostly explicitly not allowed:

(examples are using 0.23.4)

In [5]: a = pd.SparseArray([1, 0, 0, 1, 0])

In [6]: a
Out[6]: 
[1, 0, 0, 1, 0]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

In [7]: a.fill_value = np.nan
...
ValueError: unable to set fill_value nan to int64 dtype
In [8]: a = pd.SparseArray([1, np.nan, np.nan, 1, np.nan])

In [9]: a
Out[9]: 
[1.0, nan, nan, 1.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

In [10]: a.astype(int)
...
ValueError: unable to coerce current fill_value nan to int64 dtype

One place where we did not check it, was the constructor. However, I think it is only possible to that way create an array with an incompatible fill_value if it is fully dense?

In [11]: a = pd.SparseArray([1, 0, 0, 1], fill_value=np.nan)

In [12]: a
Out[12]: 
[1, 0, 0, 1]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

Because if there is already a np.nan in the list, the dtype of the data will be float64, and then np.nan is not an incompatible fill_value.


On master we now allow this, but this gives rise to inconsistencies:

(examples are using master now)

In [2]: a = pd.SparseArray([1, 0, 0, 1])

In [3]: a.fill_value = np.nan

In [4]: a
Out[4]: 
[1, nan, nan, 1]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

In [5]: a.dtype
Out[5]: Sparse[int64, nan]

In [6]: a.to_dense()
Out[6]: 
array([                   1, -9223372036854775808, -9223372036854775808,
                          1])

In [7]: np.array(a)
Out[7]: array([  1.,  nan,  nan,   1.])

In [8]: np.array(a).dtype
Out[8]: dtype('float64')

So you could of course say that to_dense is incorrect (and should fill with np.nan), but on the other hand I find it also inconsistent that the dtype of the densified array is different from the dtype of the sparse array.

And #23125 about astype is a related issue I think.

@jorisvandenbossche
Copy link
Member

I suspect a primary motivation was supporting sparse integer values with NaN for a fill value. We should investigate what's tested, part of the API, and useful to users.

Do you have an example of how to create such an integer sparse array?

Even with constructing it explicitly with passing integer sp_values and sp_index, the result has float dtype:

In [32]: a = pd.SparseArray([0, 1, 0, 1])

In [33]: a.sp_values
Out[33]: array([1, 1])

In [34]: a.sp_values.dtype
Out[34]: dtype('int64')

In [35]: new_a = pd.SparseArray(a.sp_values, sparse_index=a.sp_index, fill_value=np.nan)

In [36]: new_a.dtype
Out[36]: dtype('float64')

So my suspicion is that it was actually not meant to be supported (but of course, only guessing here, as I never used it


So for me the main questions is; what is the use case of allowing this? Why allowing the potential confusing, instead of adding a few checks?

And then the "integer with NA" use case you mention is indeed one possible answer. But since (I think, see above) this is a new possibility, and not kept for backwards compatibility, I am personally not sure we should allow it for that reason.

@TomAugspurger
Copy link
Contributor Author

Hmm thanks for testing that out... I thought that additional tests broke, but I could be misremembering.

I do agree that we shouldn't be adding new functionality. If the only thing that breaks is the constructor, i.e.

In [13]: a = pd.SparseArray([1, 0, 0, 1], fill_value=np.nan)

In [14]: a
Out[14]:
[1, 0, 0, 1]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

then I'm OK with not allowing that, because it isn't actually saving any memory in that case.

@jorisvandenbossche
Copy link
Member

Hmm thanks for testing that out... I thought that additional tests broke, but I could be misremembering.

I might be wrong of course, I only tried out some obvious things. Might be good to assure that with the tests.

@TomAugspurger
Copy link
Contributor Author

FYI, I think this will require a change in DataFrame.to_sparse. Right now the default fill_value is just np.nan. When you have e.g. an int array, you can't have np.nan for your fill_value.

So I think we need to change the meaning of fill_value=None in to_sparse. Instead of just being nan, it will be na_value_for_dtype(dtype) on each column.

We'll also I think need to update the API to allow fill_value to be a mapping of {column: fill_value} when you have a mixed dataframe and want different fill values for each column.

@mzeitlin11
Copy link
Member

Somewhat xref #39584

@bdrum
Copy link
Contributor

bdrum commented Jan 20, 2022

take

@bdrum
Copy link
Contributor

bdrum commented Jan 22, 2022

@TomAugspurger

Now we have situation when SparseDType keeps both the type of non zero elements and the VALUE of fill_value.
Why do we need keep fill_value in sparsedtype?
I think this is absolutely no reason to do this.

The crucial thing of sparse array (as I understand, but I'm not a specialist in this topic) is only non zero values and indices.
There is no matter what exactly fill_value we have, zero, nan, or another because we keep in memory only non zero elements array and their positions i.e. indices.

According to Julia sparsevector docs:

In Julia, sparse vectors have the type SparseVector{Tv,Ti} where Tv is the type of the stored values and Ti the integer type for the indices. The internal representation is as follows:

struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
    n::Int              # Length of the sparse vector
    nzind::Vector{Ti}   # Indices of stored values
    nzval::Vector{Tv}   # Stored values, typically nonzeros
end

Looks like more rational to go on this way.

I mentioned this issues in #45126 and it looks like api issue. So I suggest to close this and concentrate on #45126.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants