Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: np.nan is converted to pd.NA in nullable column. #56836

Closed
3 tasks done
soerenwolfers opened this issue Jan 11, 2024 · 4 comments
Closed
3 tasks done

BUG: np.nan is converted to pd.NA in nullable column. #56836

soerenwolfers opened this issue Jan 11, 2024 · 4 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@soerenwolfers
Copy link

soerenwolfers commented Jan 11, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
pd.Series([np.nan], dtype=pd.Float64Dtype())

Issue Description

Above returns

0    <NA>
dtype: Float64

but shouldn't because
np.nan != pd.NA, or IEEE-NotANumber != missing data (that was the whole point of having nullable columns).

Expected Behavior

0    np.nan
dtype: Float64

polars does it right:

import polars as pl
pl.Series([np.nan, None])
shape: (2,)
Series: '' [f64]
[
	NaN
	null
]

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

@soerenwolfers soerenwolfers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2024
@mroeschke
Copy link
Member

Thanks for the issue. This issue is being actively discussed in #32265 so happy to have your input there.

@soerenwolfers
Copy link
Author

soerenwolfers commented Jan 11, 2024

@mroeschke I don't think it makes sense to conflate an obvious bug with a four year discussion. As to why this is obviously a bug:

  1. The opinion that IEEE-NaN shouldn't exist in a nullable data type is completely singular. I don't think anybody has ever done this in the fourty years since IEEE-754
  2. The opinion that NaN shouldn't exist in a nullable column is clearly not represented in the current codebase, since my example shows both NA and NaN do exist in Float32Dtype columns
  3. Let's say the bug I describe here is fixed and let's humor the idea that the opinion "NaN shouldn't exist in nullable columns" becomes popular and implemented. There is no way the bugfix I describe here would make that future work harder or even be missed: if NaN were literally made a forbidden value, the bugfix could not possibly accidentally be kept alive. Hence, independent of the discussion over at API: distinguish NA vs NaN in floating dtypes #32265 er have a bug here that is worth fixing now

Frankly, the discussion you linked is too tolerant of fringe opinions in my opinion. Saying "having both NA and NaN in a nullable column could be confusing for users" (quote from there) is no different from saying "having both NULL and empty strings in a nullable string column is confusing for the user". It's right there in the name "nullable float": you take your floats (which everyone agrees what they are since IEEE-754, and which do include NaN), and you add a NULL, which pandas calls NA.

Everybody that doesn't want both NA and NaN in the same column could always not use a nullable float column.

If IEEE had called it "invalid" instead of "NaN", the discussion in #32265 wouldn't exist.

Imagine Microsoft decided that a double? in C# couldn't contain a NaN anymore! Imagine your database started returning NULL when you compute 0/0!

I feel like all serious discussion in #32265 is about questions how to treat legacy concerns, like pd.isnull(np.nan)=True, but those can be legitimately discussed without blocking bugfixes that couldn't even be formulated in the pre-NA legacy era.

@soerenwolfers
Copy link
Author

soerenwolfers commented Jan 11, 2024

Actually, after just reading the entire discussion in #32665 I think I'll take to heart the advice given by @jbrockmendel "Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes".
There are just too many opinions and too much legacy around to be worth it right now. Thanks for your response and sorry for my opinionated response to it

I still think it'd make sense to fix this issue independently of #32265 but I have no hope it would make the nullable float experience in pandas noticeably saner in the short term. I'll add my few cents at #32265 nonetheless.

@mroeschke
Copy link
Member

mroeschke commented Jan 11, 2024

FWIW I would also hope pandas would one day have np.nan be retained independently from pd.NA but as you discovered there's a lot of legacy arguments for np.nan meaning "NA"

As an alternative, you can use pyarrow to have nan be distinguished from na

In [9]: pd.Series(pd.arrays.ArrowExtensionArray(pa.array([float("nan"), None])))
Out[9]: 
0     NaN
1    <NA>
dtype: double[pyarrow]

@jbrockmendel jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

No branches or pull requests

3 participants