BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other` #58977

michaelpradel · 2024-06-11T07:25:14Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s1 = pd.Series([None, None, None], index=['a', 'b', 'c'])
s2 = pd.Series([None, None, None], index=['b', 'c', 'd'])

result = s1.combine_first(s2)
print(result)

Issue Description

The documentation of Series.combine_first states:

Update null elements with value in the same location in ‘other’.

In the above example, s2 doesn't provide any value at index 'a', so combine_first should not affect this index.
However, the result with pandas 2.2.2 is the following, which unexpectedly changes the value at 'a' from None to NaN:

a     NaN
b    None
c    None
d    None
dtype: object

Pandas 2.1.4 behaves as expected, so this looks like a regression.
The behavior was changed with this PR: #57034

Expected Behavior

The expected output is:

a    None
b    None
c    None
d    None
dtype: object

Installed Versions

INSTALLED VERSIONS ------------------ commit : 6dbeeb4 python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-35-generic Version : #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.0rc0+28.g6dbeeb4009
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 63.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.2.0
hypothesis : 6.100.2
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.24.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 16.0.0
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.13.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

pandyah5 · 2024-06-12T01:23:09Z

I would like to take this up. Working on a fix now

pandyah5 · 2024-06-12T02:15:19Z

I can reproduce the issue locally and the problem seems to originate from:

pandas/pandas/core/series.py

Line 3149 in bbe0e53

return this.mask(this.isna(), other)

The root cause of the issue:

After the outer join s1 turns into : None, None, None, NaN and s2 turns into: NaN, None, None, None.
combine_first replaces all values in s1 that satisfy isna(). Since all of them do, it replaces every value with that of s2.
After the join the default value is NaN and not None, as seen in point 1. So when s2 is copied over s1 we get NaN, None, None, None

My fix is going to be:
Do not replace None with NaN. Or for that matter NaN with None. They should be treated one and the same.

pandyah5 · 2024-06-12T05:45:34Z

@mroeschke I am tagging you here because you have previously reviewed the code for the combine_first function and I assume you might be maintaining it. I read quite a bit about Pandas convention of representing NA values for this issue and I stumbled upon this StackOverflow which is inspired by the official docs.

The problem arises because Pandas considers None and NaN the same, take for example the isna() or isnull() utility. Based on the resources I read above, Pandas has decided to use NaN as its default NA representation. Based on these points, I feel the best course of action is to convert the None values to NaN and then invoke the combine_first function. This will add O(N) computational overhead given the fillna() operation.

@michaelpradel If you have a better alternative please let me know!

michaelpradel · 2024-06-18T21:17:51Z

@pandyah5: Your proposed fix sounds good to me, at least for series with (inferred) dtype "object". Note that I'm not a pandas developers, just a user; so please take my opinion on this with a grain of salt.

pandyah5 · 2024-07-04T04:20:54Z

Hi @michaelpradel thanks for your feedback! Its been a while since this issue has been opened so I'll bring this up in the community and get some maintainers attention here to have their final verdict :)

Update: I have notified the pandas-dev slack channel for some help in reviewing this PR

michaelpradel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 11, 2024

pandyah5 mentioned this issue Jun 12, 2024

Handling None values as NaN in combine_first #58982

Closed

5 tasks

Lavishgangwani linked a pull request Oct 7, 2024 that will close this issue

Fix: Change None values to NaN in combine_first method for better handling of missing data #59987

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other` #58977

BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other` #58977

michaelpradel commented Jun 11, 2024

pandyah5 commented Jun 12, 2024

pandyah5 commented Jun 12, 2024 •

edited

Loading

pandyah5 commented Jun 12, 2024 •

edited

Loading

michaelpradel commented Jun 18, 2024

pandyah5 commented Jul 4, 2024 •

edited

Loading

BUG: Series.combine_first replaces None with NaN at index not provided by other #58977

BUG: Series.combine_first replaces None with NaN at index not provided by other #58977

Comments

michaelpradel commented Jun 11, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

pandyah5 commented Jun 12, 2024

pandyah5 commented Jun 12, 2024 • edited Loading

pandyah5 commented Jun 12, 2024 • edited Loading

michaelpradel commented Jun 18, 2024

pandyah5 commented Jul 4, 2024 • edited Loading

BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other` #58977

BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other` #58977

pandyah5 commented Jun 12, 2024 •

edited

Loading

pandyah5 commented Jun 12, 2024 •

edited

Loading

pandyah5 commented Jul 4, 2024 •

edited

Loading