Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

TomAugspurger · 2018-10-13T12:03:26Z

This has confused me for a while, but we apparently (sometimes?) allow a fill_value whose dtype is not the same as sp_values.dtype

In [20]: a = pd.SparseArray([1, 2, 3], fill_value=1.0)

In [21]: a
Out[21]:
[1.0, 2, 3]
Fill: 1.0
IntIndex
Indices: array([1, 2], dtype=int32)

In [22]: a.sp_values.dtype
Out[22]: dtype('int64')

This can lead to confusing behavior when doing operations.

I suspect a primary motivation was supporting sparse integer values with NaN for a fill value. We should investigate what's tested, part of the API, and useful to users.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-10-15T11:47:55Z

we apparently (sometimes?) allow a fill_value whose dtype is not the same as sp_values.dtype

It was indeed allowed in some places, but in previous releases, I think it was mostly explicitly not allowed:

(examples are using 0.23.4)

In [5]: a = pd.SparseArray([1, 0, 0, 1, 0])

In [6]: a
Out[6]: 
[1, 0, 0, 1, 0]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

In [7]: a.fill_value = np.nan
...
ValueError: unable to set fill_value nan to int64 dtype

In [8]: a = pd.SparseArray([1, np.nan, np.nan, 1, np.nan])

In [9]: a
Out[9]: 
[1.0, nan, nan, 1.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

In [10]: a.astype(int)
...
ValueError: unable to coerce current fill_value nan to int64 dtype

One place where we did not check it, was the constructor. However, I think it is only possible to that way create an array with an incompatible fill_value if it is fully dense?

In [11]: a = pd.SparseArray([1, 0, 0, 1], fill_value=np.nan)

In [12]: a
Out[12]: 
[1, 0, 0, 1]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

Because if there is already a np.nan in the list, the dtype of the data will be float64, and then np.nan is not an incompatible fill_value.

On master we now allow this, but this gives rise to inconsistencies:

(examples are using master now)

In [2]: a = pd.SparseArray([1, 0, 0, 1])

In [3]: a.fill_value = np.nan

In [4]: a
Out[4]: 
[1, nan, nan, 1]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

In [5]: a.dtype
Out[5]: Sparse[int64, nan]

In [6]: a.to_dense()
Out[6]: 
array([                   1, -9223372036854775808, -9223372036854775808,
                          1])

In [7]: np.array(a)
Out[7]: array([  1.,  nan,  nan,   1.])

In [8]: np.array(a).dtype
Out[8]: dtype('float64')

So you could of course say that to_dense is incorrect (and should fill with np.nan), but on the other hand I find it also inconsistent that the dtype of the densified array is different from the dtype of the sparse array.

And #23125 about astype is a related issue I think.

jorisvandenbossche · 2018-10-15T11:53:39Z

I suspect a primary motivation was supporting sparse integer values with NaN for a fill value. We should investigate what's tested, part of the API, and useful to users.

Do you have an example of how to create such an integer sparse array?

Even with constructing it explicitly with passing integer sp_values and sp_index, the result has float dtype:

In [32]: a = pd.SparseArray([0, 1, 0, 1])

In [33]: a.sp_values
Out[33]: array([1, 1])

In [34]: a.sp_values.dtype
Out[34]: dtype('int64')

In [35]: new_a = pd.SparseArray(a.sp_values, sparse_index=a.sp_index, fill_value=np.nan)

In [36]: new_a.dtype
Out[36]: dtype('float64')

So my suspicion is that it was actually not meant to be supported (but of course, only guessing here, as I never used it

So for me the main questions is; what is the use case of allowing this? Why allowing the potential confusing, instead of adding a few checks?

And then the "integer with NA" use case you mention is indeed one possible answer. But since (I think, see above) this is a new possibility, and not kept for backwards compatibility, I am personally not sure we should allow it for that reason.

TomAugspurger · 2018-10-15T12:10:56Z

Hmm thanks for testing that out... I thought that additional tests broke, but I could be misremembering.

I do agree that we shouldn't be adding new functionality. If the only thing that breaks is the constructor, i.e.

In [13]: a = pd.SparseArray([1, 0, 0, 1], fill_value=np.nan)

In [14]: a
Out[14]:
[1, 0, 0, 1]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

then I'm OK with not allowing that, because it isn't actually saving any memory in that case.

jorisvandenbossche · 2018-10-15T12:19:39Z

Hmm thanks for testing that out... I thought that additional tests broke, but I could be misremembering.

I might be wrong of course, I only tried out some obvious things. Might be good to assure that with the tests.

TomAugspurger · 2018-10-24T19:11:28Z

FYI, I think this will require a change in DataFrame.to_sparse. Right now the default fill_value is just np.nan. When you have e.g. an int array, you can't have np.nan for your fill_value.

So I think we need to change the meaning of fill_value=None in to_sparse. Instead of just being nan, it will be na_value_for_dtype(dtype) on each column.

We'll also I think need to update the API to allow fill_value to be a mapping of {column: fill_value} when you have a mixed dataframe and want different fill values for each column.

mzeitlin11 · 2021-11-10T18:16:44Z

Somewhat xref #39584

bdrum · 2022-01-20T10:01:19Z

take

bdrum · 2022-01-22T10:46:20Z

@TomAugspurger

Now we have situation when SparseDType keeps both the type of non zero elements and the VALUE of fill_value.
Why do we need keep fill_value in sparsedtype?
I think this is absolutely no reason to do this.

The crucial thing of sparse array (as I understand, but I'm not a specialist in this topic) is only non zero values and indices.
There is no matter what exactly fill_value we have, zero, nan, or another because we keep in memory only non zero elements array and their positions i.e. indices.

According to Julia sparsevector docs:

In Julia, sparse vectors have the type SparseVector{Tv,Ti} where Tv is the type of the stored values and Ti the integer type for the indices. The internal representation is as follows:

struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
    n::Int              # Length of the sparse vector
    nzind::Vector{Ti}   # Indices of stored values
    nzval::Vector{Tv}   # Stored values, typically nonzeros
end

Looks like more rational to go on this way.

I mentioned this issues in #45126 and it looks like api issue. So I suggest to close this and concentrate on #45126.

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Oct 13, 2018

TomAugspurger added this to the 0.24.0 milestone Oct 13, 2018

TomAugspurger added the API Design label Oct 13, 2018

TomAugspurger mentioned this issue Oct 15, 2018

SparseArray is an ExtensionArray #22325

Merged

4 tasks

This was referenced Oct 24, 2018

TST: Update sparse data generation #23323

Merged

REF: collect ops dispatch functions in one place, try to de-duplicate SparseDataFrame methods #23060

Merged

jreback modified the milestones: 0.24.0, Contributions Welcome Nov 25, 2018

mroeschke added the Enhancement label Apr 25, 2020

mroeschke removed the API Design label Jun 23, 2021

jbrockmendel mentioned this issue Dec 23, 2021

BUG: Sparse[bool] type allows any fill_value #45004

Closed

3 tasks

bdrum mentioned this issue Dec 30, 2021

BUG: SparseArray is not implemented as index first #45126

Open

7 tasks

github-actions bot assigned bdrum Jan 20, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue May 2, 2023

DEPR: require SparseDtype.fill_value be compatible with SparseDtype.subtype #53043

Merged

5 tasks

mroeschke closed this as completed in #53043 May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 24, 2018

mzeitlin11 commented Nov 10, 2021

bdrum commented Jan 20, 2022

bdrum commented Jan 22, 2022

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

Comments

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 24, 2018

mzeitlin11 commented Nov 10, 2021

bdrum commented Jan 20, 2022

bdrum commented Jan 22, 2022