Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: value-dependent behaviour in concat with all-NA data #40893

Closed
jorisvandenbossche opened this issue Apr 12, 2021 · 0 comments · Fixed by #52613
Closed

API: value-dependent behaviour in concat with all-NA data #40893

jorisvandenbossche opened this issue Apr 12, 2021 · 0 comments · Fixed by #52613
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jorisvandenbossche
Copy link
Member

In general, we want to get rid of value-dependent behaviour in concat-operations: the resulting dtype of a concat-operation only depends on the input dtypes, and not on the exact content (the exact values) of the inputs.

This has been discussed in the past on general occasions, eg in #33607 when adding the general EA interface for concat (there is still one value-dependent special case for Categorical involving integer categories / missing values, encoded in core/dtypes/concat.py::cast_to_common_type), or #39122 about this issue when concerning empty series/dataframes.

But so one other case (which came up recently in eg #39574 and #39612) is related to all-NA/NaN objects.

For DataFrames, when there is all-missing column, its type gets ignored when determining the result dtype (which, however, requires inspecting the values of the column). Small example:

>>> df_missing = pd.DataFrame({'a': [np.nan]})
>>> df_dt64 = pd.DataFrame({'a': [pd.Timestamp("2021-01-01")]}, dtype="datetime64[ns]")

>>> pd.concat([df_missing, df_dt64])
           a
0        NaT
0 2021-01-01

>>> pd.concat([df_missing, df_dt64]).dtypes
a    datetime64[ns]
dtype: object

This can be useful, as you can get such object/float dtype columns depending on how those "empty" all-NaN DataFrames are created (eg when constructing a DataFrame with given index/column but without data, or by reindexing the rows of an actual empty DataFrame, or reindexing the columns of a non-empty DataFrame).

However, it does introduce annoying value-dependent behaviour, and is also not very consistent throughout pandas. For example, Series does not check for this, and will actually result in object dtype:

>>> pd.concat([df_missing['a'], df_dt64['a']])
0                    NaN
0    2021-01-01 00:00:00
Name: a, dtype: object

Further, this is also not consistent across data types. For example, we don't check for all-NA for the new nullable dtypes.

For ArrayManager, I didn't yet implement any special case value-dependent behaviour (#39612, so on this aspect it diverges from the BlockManager behaviour), as it would be good to first decide on the desired behaviour long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants