Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Made SparseDataFrame.fillna() fill all NaNs #16178

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 6 additions & 8 deletions pandas/core/sparse/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -595,14 +595,12 @@ def fillna(self, value, downcast=None):
if issubclass(self.dtype.type, np.floating):
value = float(value)

if self._null_fill_value:
return self._simple_new(self.sp_values, self.sp_index,
fill_value=value)
else:
new_values = self.sp_values.copy()
new_values[isnull(new_values)] = value
return self._simple_new(new_values, self.sp_index,
fill_value=self.fill_value)
new_values = self.sp_values.copy()
new_values[isnull(new_values)] = value
fill_value = value if isnull(self.fill_value) else self.fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value if self._null_fill_value else self.fill_value


return self._simple_new(new_values, self.sp_index,
fill_value=fill_value)

def sum(self, axis=0, *args, **kwargs):
"""
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/sparse/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1252,6 +1252,7 @@ def test_from_scipy_correct_ordering(spmatrix):
tm.skip_if_no_package('scipy')

arr = np.arange(1, 5).reshape(2, 2)

try:
spm = spmatrix(arr)
assert spm.dtype == arr.dtype
Expand All @@ -1267,6 +1268,33 @@ def test_from_scipy_correct_ordering(spmatrix):
tm.assert_frame_equal(sdf.to_dense(), expected.to_dense())


def test_from_scipy_object_fillna(spmatrix):
# GH 16112
tm.skip_if_no_package('scipy', max_version='0.19.0')

arr = np.eye(3)
arr[1:, 0] = np.nan

try:
spm = spmatrix(arr)
assert spm.dtype == arr.dtype
except (TypeError, AssertionError):
# If conversion to sparse fails for this spmatrix type and arr.dtype,
# then the combination is not currently supported in NumPy, so we
# can just skip testing it thoroughly
return

sdf = pd.SparseDataFrame(spm).fillna(-1.0)

# Returning frame should fill all nan values with -1.0
expected = pd.SparseDataFrame({0: {0: 1.0, 1: np.nan, 2: np.nan},
1: {0: np.nan, 1: 1.0, 2: np.nan},
2: {0: np.nan, 1: np.nan, 2: 1.0}}
).fillna(-1.0)

tm.assert_frame_equal(sdf.to_dense(), expected.to_dense())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to compare the sparse version here (and dense also compared is fine too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use .fillna on the expected!. directly construct it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an inconsistency in the way the BlockIndexes are calculated in the initialization from the sparse matrix and from a dict resulting in different block lengths. Should I file this as another issue and xfail the sparse version comparisons for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when you compare the two dense frames, does the discrepancy still matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the comparison between the dense frames works fine. As far as I observed, the only difference is the block lengths attribute of the BlockIndexes in the SparseDataFrames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps because in the dict case, the column 0 of sparse frame is inited with three values: 1, nan, nan. Whereas in case of spmatrix, the block index starts at 0, has length 1, and points to the one 1.0. Which one is correct, I wouldn't dare say. Does passing SparseDataFrame(..., default_fill_value=...) make a difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, even after passing a default_fill_value, the same error still persists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, does it work if you instead init the expected frame like so:

expected = pd.SparseDataFrame(
    {0: {0: 1.0, 1: np.nan, 2: np.nan},
     1: {1: 1.0},
     2: {2: 1.0}}).fillna(-1.0)

I.e. without explicit NaNs. I guess NaNs as values are distinct from no values in sparse matrices. The alternative which would also work (but is obviously worse) is to init spmatrix like:

arr = np.eye(3)
arr[1:, 0] = np.nan

arr[arr == 0] = np.nan
spm = spmatrix(arr)

In either case, the positions of defined values vs. non-defined must match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keitakurita I can confirm above works. Do you intend to continue with this or do you allow me to take over, amending your commit, retaining your authorship info? I'd just really like to see this fixed. 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kernc
Thanks, you can take over :)



class TestSparseDataFrameArithmetic(object):

def test_numeric_op_scalar(self):
Expand Down