BUG: `np.nan` is converted to `pd.NA` in nullable column. #56836

soerenwolfers · 2024-01-11T14:12:16Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
pd.Series([np.nan], dtype=pd.Float64Dtype())

Issue Description

Above returns

0    <NA>
dtype: Float64

but shouldn't because
np.nan != pd.NA, or IEEE-NotANumber != missing data (that was the whole point of having nullable columns).

Expected Behavior

0    np.nan
dtype: Float64

polars does it right:

import polars as pl
pl.Series([np.nan, None])

shape: (2,)
Series: '' [f64]
[
	NaN
	null
]

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2024-01-11T16:28:17Z

Thanks for the issue. This issue is being actively discussed in #32265 so happy to have your input there.

soerenwolfers · 2024-01-11T19:37:58Z

@mroeschke I don't think it makes sense to conflate an obvious bug with a four year discussion. As to why this is obviously a bug:

The opinion that IEEE-NaN shouldn't exist in a nullable data type is completely singular. I don't think anybody has ever done this in the fourty years since IEEE-754
The opinion that NaN shouldn't exist in a nullable column is clearly not represented in the current codebase, since my example shows both NA and NaN do exist in Float32Dtype columns
Let's say the bug I describe here is fixed and let's humor the idea that the opinion "NaN shouldn't exist in nullable columns" becomes popular and implemented. There is no way the bugfix I describe here would make that future work harder or even be missed: if NaN were literally made a forbidden value, the bugfix could not possibly accidentally be kept alive. Hence, independent of the discussion over at API: distinguish NA vs NaN in floating dtypes #32265 er have a bug here that is worth fixing now

Frankly, the discussion you linked is too tolerant of fringe opinions in my opinion. Saying "having both NA and NaN in a nullable column could be confusing for users" (quote from there) is no different from saying "having both NULL and empty strings in a nullable string column is confusing for the user". It's right there in the name "nullable float": you take your floats (which everyone agrees what they are since IEEE-754, and which do include NaN), and you add a NULL, which pandas calls NA.

Everybody that doesn't want both NA and NaN in the same column could always not use a nullable float column.

If IEEE had called it "invalid" instead of "NaN", the discussion in #32265 wouldn't exist.

Imagine Microsoft decided that a double? in C# couldn't contain a NaN anymore! Imagine your database started returning NULL when you compute 0/0!

I feel like all serious discussion in #32265 is about questions how to treat legacy concerns, like pd.isnull(np.nan)=True, but those can be legitimately discussed without blocking bugfixes that couldn't even be formulated in the pre-NA legacy era.

soerenwolfers · 2024-01-11T20:42:27Z

Actually, after just reading the entire discussion in #32665 I think I'll take to heart the advice given by @jbrockmendel "Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes".
There are just too many opinions and too much legacy around to be worth it right now. Thanks for your response and sorry for my opinionated response to it

I still think it'd make sense to fix this issue independently of #32265 but I have no hope it would make the nullable float experience in pandas noticeably saner in the short term. I'll add my few cents at #32265 nonetheless.

mroeschke · 2024-01-11T20:50:59Z

FWIW I would also hope pandas would one day have np.nan be retained independently from pd.NA but as you discovered there's a lot of legacy arguments for np.nan meaning "NA"

As an alternative, you can use pyarrow to have nan be distinguished from na

In [9]: pd.Series(pd.arrays.ArrowExtensionArray(pa.array([float("nan"), None])))
Out[9]: 
0     NaN
1    <NA>
dtype: double[pyarrow]

soerenwolfers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2024

mroeschke closed this as completed Jan 11, 2024

soerenwolfers mentioned this issue Jan 12, 2024

API: distinguish NA vs NaN in floating dtypes #32265

Open

jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `np.nan` is converted to `pd.NA` in nullable column. #56836

BUG: `np.nan` is converted to `pd.NA` in nullable column. #56836

soerenwolfers commented Jan 11, 2024 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Jan 11, 2024

soerenwolfers commented Jan 11, 2024 •

edited

Loading

soerenwolfers commented Jan 11, 2024 •

edited

Loading

mroeschke commented Jan 11, 2024 •

edited

Loading

BUG: np.nan is converted to pd.NA in nullable column. #56836

BUG: np.nan is converted to pd.NA in nullable column. #56836

Comments

soerenwolfers commented Jan 11, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Jan 11, 2024

soerenwolfers commented Jan 11, 2024 • edited Loading

soerenwolfers commented Jan 11, 2024 • edited Loading

mroeschke commented Jan 11, 2024 • edited Loading

BUG: `np.nan` is converted to `pd.NA` in nullable column. #56836

BUG: `np.nan` is converted to `pd.NA` in nullable column. #56836

soerenwolfers commented Jan 11, 2024 •

edited

Loading

soerenwolfers commented Jan 11, 2024 •

edited

Loading

soerenwolfers commented Jan 11, 2024 •

edited

Loading

mroeschke commented Jan 11, 2024 •

edited

Loading