Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Mixed DataFrame with Extension Array incorrect aggregation #34520

Closed
WillAyd opened this issue Jun 1, 2020 · 3 comments · Fixed by #35254
Closed

BUG: Mixed DataFrame with Extension Array incorrect aggregation #34520

WillAyd opened this issue Jun 1, 2020 · 3 comments · Fixed by #35254
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Jun 1, 2020

Surprised to not get an aggregation for "b" with the Int64 dtype

>>> df = pd.DataFrame([["a", 1]], columns=list("ab"))
>>> df.sum()
a    a
b    1
dtype: object
>>> df.astype({"b": "Int64"}).sum()
a    a
dtype: object
@WillAyd WillAyd added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 1, 2020
@WillAyd WillAyd changed the title BUG: Mixed DataFrame with Extension Array Produces aggregations incorrectly BUG: Mixed DataFrame with Extension Array incorrect aggregation Jun 1, 2020
@jorisvandenbossche
Copy link
Member

This is actually a regression on master, as this "worked" until recently:

In [6]: df.astype({"b": "Int64"}).sum() 
Out[6]: 
a    a
b    1
dtype: object

In [7]: pd.__version__ 
Out[7]: '1.0.3'

There are still many things not correctly working for DataFrame reductions with EA columns, which I am trying to fix in #32867

Now, the specific regression here is related to nansum / maybe to the fact that we now added an IntegerArray.sum method:

In [8]: pd.core.nanops.nansum(pd.Series(pd.array([1])))
...
ValueError: the 'dtype' parameter is not supported in the pandas implementation of sum()

This works in pandas 1.0.3, because we just convert the IntegerArray to a numpy array there, but apparently we stopped doing that.
That's maybe something to fix on its own, but for the DataFrame reduction, the actual fix is to never call nansum, but to call the actual IntegerArray-specific reduction (which is what #32867 is trying to do)

@simonjayhawkins
Copy link
Member

This bug only occurs for numeric_only=None, the default.

>>>
>>> df.astype({"b": "Int64"}).sum(numeric_only=True)
b    1
dtype: int64
>>>
>>> df.astype({"b": "Int64"}).sum(numeric_only=False)
a    a
b    1
dtype: object
>>>

@simonjayhawkins
Copy link
Member

This works in pandas 1.0.3, because we just convert the IntegerArray to a numpy array there, but apparently we stopped doing that.

it appears #32950 caused the regression

b838508 is the first bad commit
commit b838508
Author: jbrockmendel jbrockmendel@gmail.com
Date: Mon Mar 30 15:43:04 2020 -0700

REF: move mixed-dtype frame_apply check outside of _reduce try/except (#32950)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Projects
None yet
3 participants