Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine_first replaces None with NaN at index not provided by other #58977

Open
3 tasks done
michaelpradel opened this issue Jun 11, 2024 · 5 comments · May be fixed by #59987
Open
3 tasks done

BUG: Series.combine_first replaces None with NaN at index not provided by other #58977

michaelpradel opened this issue Jun 11, 2024 · 5 comments · May be fixed by #59987
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@michaelpradel
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s1 = pd.Series([None, None, None], index=['a', 'b', 'c'])
s2 = pd.Series([None, None, None], index=['b', 'c', 'd'])

result = s1.combine_first(s2)
print(result)

Issue Description

The documentation of Series.combine_first states:

Update null elements with value in the same location in ‘other’.

In the above example, s2 doesn't provide any value at index 'a', so combine_first should not affect this index.
However, the result with pandas 2.2.2 is the following, which unexpectedly changes the value at 'a' from None to NaN:

a     NaN
b    None
c    None
d    None
dtype: object

Pandas 2.1.4 behaves as expected, so this looks like a regression.
The behavior was changed with this PR: #57034

Expected Behavior

The expected output is:

a    None
b    None
c    None
d    None
dtype: object

Installed Versions

INSTALLED VERSIONS ------------------ commit : 6dbeeb4 python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-35-generic Version : #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.0rc0+28.g6dbeeb4009
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 63.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.2.0
hypothesis : 6.100.2
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.24.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 16.0.0
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.13.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@michaelpradel michaelpradel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 11, 2024
@pandyah5
Copy link

I would like to take this up. Working on a fix now

@pandyah5
Copy link

pandyah5 commented Jun 12, 2024

I can reproduce the issue locally and the problem seems to originate from:

return this.mask(this.isna(), other)

The root cause of the issue:

  • After the outer join s1 turns into : None, None, None, NaN and s2 turns into: NaN, None, None, None.
  • combine_first replaces all values in s1 that satisfy isna(). Since all of them do, it replaces every value with that of s2.
  • After the join the default value is NaN and not None, as seen in point 1. So when s2 is copied over s1 we get NaN, None, None, None

My fix is going to be:
Do not replace None with NaN. Or for that matter NaN with None. They should be treated one and the same.

@pandyah5
Copy link

pandyah5 commented Jun 12, 2024

@mroeschke I am tagging you here because you have previously reviewed the code for the combine_first function and I assume you might be maintaining it. I read quite a bit about Pandas convention of representing NA values for this issue and I stumbled upon this StackOverflow which is inspired by the official docs.

The problem arises because Pandas considers None and NaN the same, take for example the isna() or isnull() utility. Based on the resources I read above, Pandas has decided to use NaN as its default NA representation. Based on these points, I feel the best course of action is to convert the None values to NaN and then invoke the combine_first function. This will add O(N) computational overhead given the fillna() operation.

@michaelpradel If you have a better alternative please let me know!

@michaelpradel
Copy link
Author

@pandyah5: Your proposed fix sounds good to me, at least for series with (inferred) dtype "object". Note that I'm not a pandas developers, just a user; so please take my opinion on this with a grain of salt.

@pandyah5
Copy link

pandyah5 commented Jul 4, 2024

Hi @michaelpradel thanks for your feedback! Its been a while since this issue has been opened so I'll bring this up in the community and get some maintainers attention here to have their final verdict :)

Update: I have notified the pandas-dev slack channel for some help in reviewing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
2 participants