-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various methods don't call call __finalize__ #28283
Comments
How In #27108, I'm using def reconcile_concat(others: List[DataFrame]) -> bool:
"""
Allow duplicates only if all the inputs allowed them.
If any disallow them, we disallow them.
"""
return all(x.allows_duplicates for x in others) However, that reconciliation strategy isn't valid / proper for arbitrary metadata. Which I think argues for some kind of dispatch system for reconciling metadata, where the attribute gets to determine how things are handled. allows_duplicate_meta = PandasMetadata("allows_duplicates") # the attribute name
@allows_duplicate_meta.register(pd.concat) # the method
def reconcile_concat():
... Then we always pass cc @jbrockmendel, since I think you looked into metadata propagation in another issue. |
IIRC I was looking at _metadata to try to implement units (this predated EAs). One of the biggest problems I had was that metadata on a Series didn't behave well when that Series is inserted into a DataFrame. Do we have an idea of how often _metadata is used in the wild? i.e. could we deprecate it and make an EA-based implementation? |
It’s essentially undocumented, so I’m OK with being aggressive here. What would an EA-based implementation look like? For something like units, metadata may not be appropriate. I think an EA dtype makes more sense. I’ll add this to the agenda for next weeks call. |
It's a bit ambiguous what this question is referring to, but big picture something like
|
Right. Does the current EA interface suffice for that use case, or are there additional hooks needed? |
Not a blocker for 1.0. |
Progress towards pandas-dev#28283. This calls `finalize` for all the public series methods where I think it makes sense.
Do people think that accessor methods like |
But would you then handle |
Name propagation isn't (currently) handled in My two motivating use-cases here are
|
It actually is, not? At least in some cases? (eg for new Series originating from other Series, where |
Apologies, I forgot that |
When should we call finalize? A high-level list: Yes
Unsure
These are somewhat arbitrary. I can't really come up with a rule why a reduction like |
Progress towards pandas-dev#28283. This adds tests that ensures `NDFrame.__finalize__` is called in more places. Thus far I've added tests for anything that meets the following rule: > Pandas calls `NDFrame.__finalize__` on any NDFrame method that returns > another NDFrame. I think that given the generality of `__finalize__`, making any kind of list of which methods should call it is going to be somewhat arbitrary. That rule errs on the side of calling it too often, which I think is my preference.
Progress towards pandas-dev#28283. This adds tests that ensures `NDFrame.__finalize__` is called in more places. Thus far I've added tests for anything that meets the following rule: > Pandas calls `NDFrame.__finalize__` on any NDFrame method that returns > another NDFrame. I think that given the generality of `__finalize__`, making any kind of list of which methods should call it is going to be somewhat arbitrary. That rule errs on the side of calling it too often, which I think is my preference.
Progress towards pandas-dev#28283. This adds tests that ensures `NDFrame.__finalize__` is called in more places. Thus far I've added tests for anything that meets the following rule: > Pandas calls `NDFrame.__finalize__` on any NDFrame method that returns > another NDFrame. I think that given the generality of `__finalize__`, making any kind of list of which methods should call it is going to be somewhat arbitrary. That rule errs on the side of calling it too often, which I think is my preference.
I've updated the original post. Hopefully we can find some contributors interested in working on getting finalize called in more places. |
@TomAugspurger I would be interested in contributing to pandas and start by helping to tackle some of these methods. Which methods might be good places to start? |
|
Hi, I'd be happy to help tick some of the boxes off here. Would love to see |
Hi, I want to take my first issue and I'm new contributor to pandas. I'd like to get |
take |
Starting with Dataframe.idxmax() and Dataframe.idxmin() |
Hi, I'm new to open source and I'm interested in contributing to this issue. I'd like to start with |
Worked on |
Hello, |
@bobzhang-stack it looks like In [74]: df = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
In [75]: df.attrs['a'] = 1
In [76]: df.pop("B").attrs
Out[76]: {'a': 1}
In [77]: df.attrs
Out[77]: {'a': 1} It seems the majority of the remaining ones are related to operations between multiple objects with attrs (#49916). Aside from that, there's |
Hi I'm starting to work on |
Would it be possible for me to be assigned to df.merge() to attempt a fix? |
Improve coverage of
NDFrame.__finalize__
Pandas uses
NDFrame.__finalize__
to propagate metadata from one NDFrame toanother. This ensures that things like
self.attrs
andself.flags
are notlost. In general we would like that any operation that accepts one or more
NDFrames and returns an NDFrame should propagate metadata by calling
__finalize__
.The test file at
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/generic/test_finalize.py
attempts to be an exhaustive suite of tests for all these cases. However there
are many tests currently xfailing, and there are likely many APIs not covered.
This is a meta-issue to improve the use of
__finalize__
. Here's a hopefullyaccurate list of methods that don't currently call finalize.
Some general comments around finalize
attrs
when there aremultiple NDFrames involved with differing attrs (e.g. in concat). The safest
approach is to probably drop the attrs when they don't match, but this will
need some thought.
__finalize__
can be somewhat expensiveso we'd like to call it exactly once per user-facing method. This can be tricky
for things like
DataFrame.apply
which is sometimes used internally. We may needto refactor some methods to have a user-facing
DataFrame.apply
that calls an internalDataFrame._apply
. The internal method would not call__finalize__
, just theuser-facing
DataFrame.apply
would.If you're interested in working on this please post a comment indicating which method
you're working on. Un-xfail the test, then update the method to pass the test. Some of these
will be much more difficult to work on than others (e.g. groupby is going to be difficult). If you're
unsure whether a particular method is likely to be difficult, ask first.
DataFrame.__getitem__
with a scalarDataFrame.eval
withengine="numexpr"
DataFrame.duplicated
DataFrame.add
,mul
, etc. (at least for most things; some work to do on conflicts / overlapping attrs in binops)DataFrame.combine
,DataFrame.combine_first
DataFrame.update
DataFrame.pivot
,pivot_table
DataFrame.stack
DataFrame.unstack
DataFrame.explode
BUG: added finalize to explode, GH28283 #46629DataFrame.melt
BUG/TST/DOC: added finalize to melt, GH28283 #46648DataFrame.diff
DataFrame.applymap
DataFrame.append
DataFrame.merge
DataFrame.cov
DataFrame.corrwith
DataFrame.count
DataFrame.nunique
DataFrame.idxmax
,idxmin
DataFrame.mode
DataFrame.quantile
(scalar and list of quantiles)DataFrame.isin
DataFrame.pop
DataFrame.squeeze
Series.abs
DataFrame.get
DataFrame.round
DataFrame.convert_dtypes
DataFrame.pct_change
DataFrame.transform
DataFrame.apply
DataFrame.any
,sum
,std
,mean
, etdc.Series.str.
operations returning a Series / DataFrameSeries.dt.
operations returning a Series / DataFrameSeries.cat.
operations returning a Series / DataFrame.iloc
/.loc
Add missing __finalize__ calls in indexers/iterators #46101The text was updated successfully, but these errors were encountered: