BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

aamster · 2022-01-26T15:15:17Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.concat([pd.DataFrame(columns=['foo', 'bar']), pd.DataFrame({'foo': [1.0, 2.0], 'bar': [1.0, 2.0]})]).dtypes

Out[3]: 
foo    object
bar    object
dtype: object

Issue Description

When using concat with empty dataframe with columns argument passed, and a nonempty dataframe, the dtypes of the resulting dataframe are coerced to object.

Expected Behavior

I would expect the dtypes to be taken from the nonempty dataframe, as was the behavior in previous versions of pandas.
This issue can be avoided if dtypes are explicitly passed, which maybe is intentional, but still it is unexpected.

Pandas 1.3.5 behavior:

In [120]: pd.concat([pd.DataFrame(columns=['foo', 'bar']), pd.DataFrame({'foo': [1.0, 2.0], 'bar': [1.0, 2.0]})]).dtypes
     ...: 
Out[120]: 
foo    float64
bar    float64
dtype: object

Installed Versions

INSTALLED VERSIONS
------------------
commit           : bb1f651536508cdfef8550f93ace7849b00046ee
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-957.27.2.el7.x86_64
Version          : #1 SMP Mon Jul 29 17:46:05 UTC 2019
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8
pandas           : 1.4.0
numpy            : 1.21.5
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 60.5.0
Cython           : None
pytest           : 5.4.3
hypothesis       : None
sphinx           : 4.4.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 2.11.3
IPython          : 8.0.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.4.2
numba            : None
numexpr          : 2.8.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : 1.4.31
tables           : 3.6.1
tabulate         : None
xarray           : 0.15.1
xlrd             : None
xlwt             : None
zstandard        : None

The text was updated successfully, but these errors were encountered:

phofl · 2022-01-26T18:29:26Z

This was an intended change, see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

Your empty DataFrame has dtype object, a common dtype between float and object is object.

cc @jbrockmendel to be sure

jbrockmendel · 2022-01-29T21:16:57Z

Yes this was intended.

jorisvandenbossche · 2022-02-09T17:38:50Z

This was intentionally changed, but as I also commented before on the PR (#43507 (comment)) and as illustrated by this report: it is a backwards incompatible change (and removing functionality to preserve the dtype that was coded intentionally), and IMO we should do it with a deprecation warning instead.

simonjayhawkins · 2022-02-11T19:27:52Z

moved to 1.4.2

mosalx · 2022-02-26T16:22:27Z

Not ignoring dtypes of a dataframe with all missing values makes sense to me. But may I ask for an example where this behavior is helpful with empty dataframes?

The issue with the current behavior is that if an empty dataframe is present, information about dtypes of all other non-empty dataframes participating in concatenation is erased. We are forced to filter the input to prevent this loss of dtype information. I can't imagine why anyone would not want to do that.

jbrockmendel · 2022-02-28T20:42:00Z

But may I ask for an example where this behavior is helpful with empty dataframes?

The general principle here is that the resulting dtype should not depend on whether any of the frames are empty or not (i.e. "values-dependent behavior")

Looks like there are still some inconsistencies in the Series vs DataFrame behavior I had previously thought eliminated:

df1 = pd.DataFrame({"A": []}).astype("datetime64[ns]")
df2 = pd.DataFrame({"A": ["foo"]}).astype("string")

>>> pd.concat([df1, df2]).dtypes[0]
dtype('O')

>>> pd.concat([df1["A"], df2["A"]]).dtype
string[python]

jorisvandenbossche · 2022-03-07T07:22:15Z

I would propose to revert that original change, and restore the special case for empty / all-NaN dataframes.

Yes, this keeps some values-dependent behaviour, while ideally we strictly look at dtypes in this case. But as long as we don't have a better way to deal with empty columns without specific dtype information (apart from using object dtype), I think practicality beats purity and it is worth it to keep the special case and avoid such a breaking change.

jbrockmendel · 2022-03-08T03:45:52Z

@jorisvandenbossche above you suggested deprecating. IIUC the idea was that 1.5 would have the old behavior and 2.0 would have the current behavior. Is that a) a correct understanding of what you were suggesting here and b) are you now suggesting something different?

jorisvandenbossche · 2022-03-08T23:49:54Z

I am not fully sure, but in any case both options (also a deprecation) require it to be reverted.
I indeed proposed to deprecate it instead above, but the examples here show for me that even after a deprecation period, it's still not ideal behaviour. IMO we should only deprecate it if we come up with a better alternative.

jbrockmendel · 2022-03-09T00:07:40Z

empty columns without specific dtype information

im not clear on what this means. an empty column is effectively a Series, which has a dtype.

xref #39122, #40893

jorisvandenbossche · 2022-03-09T09:41:21Z

an empty column is effectively a Series, which has a dtype.

Yes, and we use object dtype for empty or float64 dtype for all NaN by default. So those indeed have a specific dtype by default, but that doesn't mean that this dtype conveys the correct information about that column (eg such an all-NaN column can be introduced in a reindex operation (see my original example at #43507 (comment)), but nothing about this operation says that it should be a float64 column). And so ignoring the dtype or not in those cases matters for the result.

simonjayhawkins · 2022-03-09T12:58:29Z

I am not fully sure, but in any case both options (also a deprecation) require it to be reverted.
I indeed proposed to deprecate it instead above, but the examples here show for me that even after a deprecation period, it's still not ideal behaviour. IMO we should only deprecate it if we come up with a better alternative.

In general according to the version policy https://pandas.pydata.org/pandas-docs/dev/development/policies.html

We will not introduce new deprecations in patch releases.

So I guess that discussion can be kept independent of reverting #43507 for 1.4.x?

jbrockmendel · 2022-03-09T15:24:33Z

see my original example at #43507 (comment)

As I mentioned there, the reindex in that example (pd.concat([df1, df2.reindex(columns=df1.columns)])) is unnecessary, and without you get the desired dtypes.

Yes, and we use object dtype for empty or float64 dtype for all NaN by default. So those indeed have a specific dtype by default, but that doesn't mean that this dtype conveys the correct information about that column

Trying to whittle down the issue: would you only ignore the empty/all-NaN cases when they are object/float64, respectively?

So suppose a user has:

left_A = pd.Series([], name="object_dtype_inferred").to_frame()
left_B = pd.Series([], dtype=object, name="really_specifically_object_dtype")).to_frame()
right = pd.Series(["foo"], dtype="string")).to_frame()

res_A = pd.concat([left_A, right], ignore_index=True)
res_B = pd.concat([left_B, right], ignore_index=True)

For res_B we unambiguously want object. For res_A the argument is is we want string. Are there ways to "fix" res_A without "breaking" res_B? Off the top of my head:

a keyword for pd.concat (might be fragile if there is a mix of left_A and left_B?)
track whether a Series/column had its dtype determined by default logic (might use np.void instead of object/float64?)

If we can't have both, then I think res_B should take priority since the user was explicit about what they wanted.

jorisvandenbossche · 2022-03-09T17:33:05Z

As I mentioned there, the reindex in that example (pd.concat([df1, df2.reindex(columns=df1.columns)])) is unnecessary, and without you get the desired dtypes.

Yes, but this is only a dummy example, and back then I answered to that with the following (#43507 (comment)):

There can be good reasons for doing a reindex. For example, to determine the exact output columns: in the past, concat did have a join_axes keyword for this, but this was deprecated, pointing the user to use reindex instead.

So we have deprecated a feature for this, explicitly saying to users they can use reindex instead. But now we also have broken this reindex based workflow.

would you only ignore the empty/all-NaN cases when they are object/float64, respectively?

Yes, and that's also what we did in the past more or less (eg we didn't ignore all-NaT datetime64). I would maybe leave out the "respectively", so for now consider both empty and all-NaN for both dtypes to keep things simpler (although we should check what we did exactly before).
We could try to restrict it to only object dtype, eventually, if we would use object dtype as the default when creating an all-missing columns (such as in reindex / alignment).

track whether a Series/column had its dtype determined by default logic (might use np.void instead of object/float64?)

Long term, I think this is the way to go (something like this is what I meant with the "a better way to deal with empty columns without specific dtype information" above).
I think I mentioned it before (maybe only in a call) the option to have some kind of "null dtype", but np.void could serve the same purpose (probably the main discussion here would be if we want to make it an ExtensionDtype, but let's have that in another issue).

simonjayhawkins · 2022-03-20T14:30:32Z

So I guess that discussion can be kept independent of reverting #43507 for 1.4.x?

The revert is not straightforward since there have been some changes to concat code since #43507 e.g. removing code in #43577, and changing signatures in #43626 and #43606

after a couple of attempts of reverting these in different orders to reduce the number of conflicts to manually resolve, will revisit again soon

We are agreed that we want to revert #43507 for 1.4.x and in a separate PR targeted to main/1.5 add a deprecation warning instead?

simonjayhawkins · 2022-04-01T18:25:57Z

moving to 1.4.3

aamster added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022

rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2022

rhshadrach closed this as completed Jan 29, 2022

jorisvandenbossche reopened this Feb 9, 2022

jorisvandenbossche added this to the 1.4.1 milestone Feb 9, 2022

jorisvandenbossche removed the Usage Question label Feb 9, 2022

simonjayhawkins modified the milestones: 1.4.1, 1.4.2 Feb 11, 2022

phofl mentioned this issue Feb 21, 2022

PERF: dataframe.resample is very slowly in ver 1.4 and 1.4.1 #46066

Closed

3 tasks

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Mar 9, 2022

simonjayhawkins modified the milestones: 1.4.2, 1.4.3 Apr 1, 2022

simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Apr 13, 2022

simonjayhawkins mentioned this issue May 3, 2022

BUG: Concatenation of None with numerical values no longer converts None to Nan. #46922

Closed

3 tasks

simonjayhawkins mentioned this issue Jun 8, 2022

RLS: 1.4.3 #46610

Closed

jorisvandenbossche mentioned this issue Jun 15, 2022

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Merged

simonjayhawkins closed this as completed in #47372 Jun 22, 2022

lisphilar mentioned this issue Sep 3, 2023

[Bug] FutureWarning with the behavior of DataFrame concatenation with empty or all-NA entries lisphilar/covid19-sir#1511

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

aamster commented Jan 26, 2022 •

edited

Loading

phofl commented Jan 26, 2022

jbrockmendel commented Jan 29, 2022

jorisvandenbossche commented Feb 9, 2022

simonjayhawkins commented Feb 11, 2022

mosalx commented Feb 26, 2022

jbrockmendel commented Feb 28, 2022

jorisvandenbossche commented Mar 7, 2022

jbrockmendel commented Mar 8, 2022

jorisvandenbossche commented Mar 8, 2022

jbrockmendel commented Mar 9, 2022

jorisvandenbossche commented Mar 9, 2022

simonjayhawkins commented Mar 9, 2022

jbrockmendel commented Mar 9, 2022

jorisvandenbossche commented Mar 9, 2022

simonjayhawkins commented Mar 20, 2022

simonjayhawkins commented Apr 1, 2022

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

Comments

aamster commented Jan 26, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

phofl commented Jan 26, 2022

jbrockmendel commented Jan 29, 2022

jorisvandenbossche commented Feb 9, 2022

simonjayhawkins commented Feb 11, 2022

mosalx commented Feb 26, 2022

jbrockmendel commented Feb 28, 2022

jorisvandenbossche commented Mar 7, 2022

jbrockmendel commented Mar 8, 2022

jorisvandenbossche commented Mar 8, 2022

jbrockmendel commented Mar 9, 2022

jorisvandenbossche commented Mar 9, 2022

simonjayhawkins commented Mar 9, 2022

jbrockmendel commented Mar 9, 2022

jorisvandenbossche commented Mar 9, 2022

simonjayhawkins commented Mar 20, 2022

simonjayhawkins commented Apr 1, 2022

aamster commented Jan 26, 2022 •

edited

Loading