Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

Closed
2 of 3 tasks
aamster opened this issue Jan 26, 2022 · 16 comments · Fixed by #47372
Closed
2 of 3 tasks
Labels
Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@aamster
Copy link

aamster commented Jan 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.concat([pd.DataFrame(columns=['foo', 'bar']), pd.DataFrame({'foo': [1.0, 2.0], 'bar': [1.0, 2.0]})]).dtypes

Out[3]: 
foo    object
bar    object
dtype: object

Issue Description

When using concat with empty dataframe with columns argument passed, and a nonempty dataframe, the dtypes of the resulting dataframe are coerced to object.

Expected Behavior

I would expect the dtypes to be taken from the nonempty dataframe, as was the behavior in previous versions of pandas.
This issue can be avoided if dtypes are explicitly passed, which maybe is intentional, but still it is unexpected.

Pandas 1.3.5 behavior:

In [120]: pd.concat([pd.DataFrame(columns=['foo', 'bar']), pd.DataFrame({'foo': [1.0, 2.0], 'bar': [1.0, 2.0]})]).dtypes
     ...: 
Out[120]: 
foo    float64
bar    float64
dtype: object

Installed Versions

INSTALLED VERSIONS
------------------
commit           : bb1f651536508cdfef8550f93ace7849b00046ee
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-957.27.2.el7.x86_64
Version          : #1 SMP Mon Jul 29 17:46:05 UTC 2019
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8
pandas           : 1.4.0
numpy            : 1.21.5
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 60.5.0
Cython           : None
pytest           : 5.4.3
hypothesis       : None
sphinx           : 4.4.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 2.11.3
IPython          : 8.0.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.4.2
numba            : None
numexpr          : 2.8.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : 1.4.31
tables           : 3.6.1
tabulate         : None
xarray           : 0.15.1
xlrd             : None
xlwt             : None
zstandard        : None
@aamster aamster added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022
@phofl
Copy link
Member

phofl commented Jan 26, 2022

This was an intended change, see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

Your empty DataFrame has dtype object, a common dtype between float and object is object.

cc @jbrockmendel to be sure

@jbrockmendel
Copy link
Member

Yes this was intended.

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2022
@jorisvandenbossche
Copy link
Member

This was intentionally changed, but as I also commented before on the PR (#43507 (comment)) and as illustrated by this report: it is a backwards incompatible change (and removing functionality to preserve the dtype that was coded intentionally), and IMO we should do it with a deprecation warning instead.

@jorisvandenbossche jorisvandenbossche added this to the 1.4.1 milestone Feb 9, 2022
@simonjayhawkins simonjayhawkins modified the milestones: 1.4.1, 1.4.2 Feb 11, 2022
@simonjayhawkins
Copy link
Member

moved to 1.4.2

@mosalx
Copy link

mosalx commented Feb 26, 2022

Not ignoring dtypes of a dataframe with all missing values makes sense to me. But may I ask for an example where this behavior is helpful with empty dataframes?

The issue with the current behavior is that if an empty dataframe is present, information about dtypes of all other non-empty dataframes participating in concatenation is erased. We are forced to filter the input to prevent this loss of dtype information. I can't imagine why anyone would not want to do that.

@jbrockmendel
Copy link
Member

But may I ask for an example where this behavior is helpful with empty dataframes?

The general principle here is that the resulting dtype should not depend on whether any of the frames are empty or not (i.e. "values-dependent behavior")

Looks like there are still some inconsistencies in the Series vs DataFrame behavior I had previously thought eliminated:

df1 = pd.DataFrame({"A": []}).astype("datetime64[ns]")
df2 = pd.DataFrame({"A": ["foo"]}).astype("string")

>>> pd.concat([df1, df2]).dtypes[0]
dtype('O')

>>> pd.concat([df1["A"], df2["A"]]).dtype
string[python]

@jorisvandenbossche
Copy link
Member

I would propose to revert that original change, and restore the special case for empty / all-NaN dataframes.

Yes, this keeps some values-dependent behaviour, while ideally we strictly look at dtypes in this case. But as long as we don't have a better way to deal with empty columns without specific dtype information (apart from using object dtype), I think practicality beats purity and it is worth it to keep the special case and avoid such a breaking change.

@jbrockmendel
Copy link
Member

@jorisvandenbossche above you suggested deprecating. IIUC the idea was that 1.5 would have the old behavior and 2.0 would have the current behavior. Is that a) a correct understanding of what you were suggesting here and b) are you now suggesting something different?

@jorisvandenbossche
Copy link
Member

I am not fully sure, but in any case both options (also a deprecation) require it to be reverted.
I indeed proposed to deprecate it instead above, but the examples here show for me that even after a deprecation period, it's still not ideal behaviour. IMO we should only deprecate it if we come up with a better alternative.

@jbrockmendel
Copy link
Member

empty columns without specific dtype information

im not clear on what this means. an empty column is effectively a Series, which has a dtype.

xref #39122, #40893

@jorisvandenbossche
Copy link
Member

an empty column is effectively a Series, which has a dtype.

Yes, and we use object dtype for empty or float64 dtype for all NaN by default. So those indeed have a specific dtype by default, but that doesn't mean that this dtype conveys the correct information about that column (eg such an all-NaN column can be introduced in a reindex operation (see my original example at #43507 (comment)), but nothing about this operation says that it should be a float64 column). And so ignoring the dtype or not in those cases matters for the result.

@simonjayhawkins
Copy link
Member

I am not fully sure, but in any case both options (also a deprecation) require it to be reverted.
I indeed proposed to deprecate it instead above, but the examples here show for me that even after a deprecation period, it's still not ideal behaviour. IMO we should only deprecate it if we come up with a better alternative.

In general according to the version policy https://pandas.pydata.org/pandas-docs/dev/development/policies.html

We will not introduce new deprecations in patch releases.

So I guess that discussion can be kept independent of reverting #43507 for 1.4.x?

@jbrockmendel
Copy link
Member

see my original example at #43507 (comment)

As I mentioned there, the reindex in that example (pd.concat([df1, df2.reindex(columns=df1.columns)])) is unnecessary, and without you get the desired dtypes.

Yes, and we use object dtype for empty or float64 dtype for all NaN by default. So those indeed have a specific dtype by default, but that doesn't mean that this dtype conveys the correct information about that column

Trying to whittle down the issue: would you only ignore the empty/all-NaN cases when they are object/float64, respectively?

So suppose a user has:

left_A = pd.Series([], name="object_dtype_inferred").to_frame()
left_B = pd.Series([], dtype=object, name="really_specifically_object_dtype")).to_frame()
right = pd.Series(["foo"], dtype="string")).to_frame()

res_A = pd.concat([left_A, right], ignore_index=True)
res_B = pd.concat([left_B, right], ignore_index=True)

For res_B we unambiguously want object. For res_A the argument is is we want string. Are there ways to "fix" res_A without "breaking" res_B? Off the top of my head:

  • a keyword for pd.concat (might be fragile if there is a mix of left_A and left_B?)
  • track whether a Series/column had its dtype determined by default logic (might use np.void instead of object/float64?)

If we can't have both, then I think res_B should take priority since the user was explicit about what they wanted.

@jorisvandenbossche
Copy link
Member

As I mentioned there, the reindex in that example (pd.concat([df1, df2.reindex(columns=df1.columns)])) is unnecessary, and without you get the desired dtypes.

Yes, but this is only a dummy example, and back then I answered to that with the following (#43507 (comment)):

There can be good reasons for doing a reindex. For example, to determine the exact output columns: in the past, concat did have a join_axes keyword for this, but this was deprecated, pointing the user to use reindex instead.

So we have deprecated a feature for this, explicitly saying to users they can use reindex instead. But now we also have broken this reindex based workflow.

would you only ignore the empty/all-NaN cases when they are object/float64, respectively?

Yes, and that's also what we did in the past more or less (eg we didn't ignore all-NaT datetime64). I would maybe leave out the "respectively", so for now consider both empty and all-NaN for both dtypes to keep things simpler (although we should check what we did exactly before).
We could try to restrict it to only object dtype, eventually, if we would use object dtype as the default when creating an all-missing columns (such as in reindex / alignment).

track whether a Series/column had its dtype determined by default logic (might use np.void instead of object/float64?)

Long term, I think this is the way to go (something like this is what I meant with the "a better way to deal with empty columns without specific dtype information" above).
I think I mentioned it before (maybe only in a call) the option to have some kind of "null dtype", but np.void could serve the same purpose (probably the main discussion here would be if we want to make it an ExtensionDtype, but let's have that in another issue).

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Mar 9, 2022
@simonjayhawkins
Copy link
Member

So I guess that discussion can be kept independent of reverting #43507 for 1.4.x?

The revert is not straightforward since there have been some changes to concat code since #43507 e.g. removing code in #43577, and changing signatures in #43626 and #43606

after a couple of attempts of reverting these in different orders to reduce the number of conflicts to manually resolve, will revisit again soon

We are agreed that we want to revert #43507 for 1.4.x and in a separate PR targeted to main/1.5 add a deprecation warning instead?

@simonjayhawkins
Copy link
Member

moving to 1.4.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants