Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Use correct ExtensionArray reductions in DataFrame reductions #35254

Merged
merged 9 commits into from
Jul 15, 2020

Conversation

jbrockmendel
Copy link
Member

def func(values):
if is_extension_array_dtype(values.dtype):
return extract_array(values)._reduce(name, skipna=skipna, **kwds)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not use the same exact path for non extension arrays

these if/then for extension arrays are terrible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd have to put the analogous check in each of the nanops.nanfoo functions, which I know @jorisvandenbossche wouldn't like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is func now effectively the same as blk_func on L8575?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, doing it here is IMO much cleaner than in each nanops function.

is func now effectively the same as blk_func on L8575?

Not exactly, the axis handling is different (also blk_func knows it gets an array, so the extract_array stuff is not needed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is ok as a short term check, but seeing these constant if is_extension_array do one thing, else do something else basically means the api is missing lots of things. I would much rather fix it than keep doing this kind of thing. This makes code extremly hard to read and understand.

I remember being very excited when we removed the is_ndarray check here and this because much simpler, now we are going backwards.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this might be a path to a short-term compromise ;)

One problem is that frame_apply path is only taken for if not self._is_homogeneous_type and filter_type is None: ....
So that means eg a DataFrame with only columns of the same EAdtype will still take the incorrect path. So maybe we can edit that condition to include something like or self._any_extension_type.

I am also not sure why the filter_type="bool" functions (any/all) cannot take this path. That might give problems for the "boolean" dtype.

@simonjayhawkins
Copy link
Member

So maybe we can edit that condition to include something like or self._any_extension_type.

using self._mgr.any_extension_types also fixes #32651 (but not checked to see if anything breaks)

@jbrockmendel
Copy link
Member Author

One problem is that frame_apply path is only taken for if not self._is_homogeneous_type and filter_type is None: ....

Yah, the longer-term solution I have in mind is #34714 that all cases will go through. Is tinkering with the is_homogeneous_type check necessary for this PR, or can we punt on that until after 1.1?

@jorisvandenbossche jorisvandenbossche changed the title BUG: df.sum with Int64 dtype BUG: Use correct ExtensionArray reductions in DataFrame reductions Jul 15, 2020
@jorisvandenbossche
Copy link
Member

@jbrockmendel hope you don't mind, but I cherry-picked some commits with tests from me and Simon from the other PR onto this branch. Depending on whether we do the any_extension_type thing, we might need to xfail them, though

Is tinkering with the is_homogeneous_type check necessary for this PR, or can we punt on that until after 1.1?

I would personally prefer to do this here as well, as otherwise you get a rather surprising inconsistency which underlying operation is actually ran for DataFrames with EAs depending on wether other dtypes are present or not.

I pushed the tiny change for that as well in the last commit, to see what CI says, but we can easily undo that again.

@jorisvandenbossche
Copy link
Member

The one failure on the py37_np18 build seems to be unrelated .. Have we seen that elsewhere as well?

@simonjayhawkins
Copy link
Member

The one failure on the py37_np18 build seems to be unrelated .. Have we seen that elsewhere as well?

yep, although AFAIK we don't have an issue for this. see #34906 (comment) and #35086 (comment)

if i see this failure on a PR, I normally just re-run.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@TomAugspurger TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 15, 2020
@simonjayhawkins
Copy link
Member

needs whatsnew for #32651 (can copy from #34210)

@jorisvandenbossche
Copy link
Member

@simonjayhawkins I think the general whatsnew note added by Brock already covers that as well (can maybe just add the issue number)

@TomAugspurger
Copy link
Contributor

Merging in a few hours if no objections (cc @jreback, I think you're the only one without an explicit +1)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche your commits look good, thanks for the heads up

@jreback jreback merged commit 595208b into pandas-dev:master Jul 15, 2020
@jreback
Copy link
Contributor

jreback commented Jul 15, 2020

thanks @jbrockmendel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Mixed DataFrame with Extension Array incorrect aggregation
5 participants