Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: Deprecate inplace parameter #16529

Open
gfyoung opened this issue May 29, 2017 · 60 comments · May be fixed by #51466
Open

API/DEPR: Deprecate inplace parameter #16529

gfyoung opened this issue May 29, 2017 · 60 comments · May be fixed by #51466
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas inplace Relating to inplace parameter or equivalent Needs Discussion Requires discussion from core team before further action

Comments

@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

The parameter inplace=False should be deprecated across the board in preparation for pandas 2, which will not support that input (we will always return a copy). That would give people time to stop using it.

Thoughts?

Methods using inplace:

Deprecation non controvertial (a copy will be made anyway, and inplace=True does not add value):

  • (Series/DataFrame).drop
  • (Series/DataFrame).drop_duplicates
  • (Series/DataFrame).dropna
  • DataFrame.set_index (with drop=False wouldn't change the data, but that doesn't seem the main use case)
  • DataFrame.query
  • DataFrame.eval

Not sure:

  • (Series/DataFrame).sort_values
  • (Series/DataFrame).sort_index

Should be able to not copy memory (under discussion on what to do):

  • (Series/DataFrame).clip
  • (Series/DataFrame).where
  • (Series/DataFrame).fillna
  • (Series/DataFrame).rename_axis
  • (Series/DataFrame).reset_index
  • (Series/DataFrame).replace
  • (Series/DataFrame).set_axis
  • (Series/DataFrame).mask
  • (Series/DataFrame).interpolate
  • DataFrame.rename
  • Index.rename
  • Index.set_names
  • MultiIndex.set_levels
  • MultiIndex.set_labels
  • pandas.core.resample.Resampler.interpolate

Special cases:

  • pandas.eval (with inplace=False the value is not returned but set to an argument target)
@gfyoung gfyoung changed the title API/DEPR: Deprecate inplace operations API/DEPR: Deprecate inplace parameter May 29, 2017
@jreback jreback added API Design Deprecate Functionality to remove in pandas labels May 30, 2017
@jreback jreback modified the milestones: 0.21.0, 1.0 May 30, 2017
@drafter250
Copy link

I thought The main usage of this feature was to mitigate memory usage in case of large DataFrames?

@gfyoung
Copy link
Member Author

gfyoung commented Jul 26, 2017

I thought The main usage of this feature was to mitigate memory usage in case of large DataFrames?

I think @jreback might know more about the initial usage, but generally from previous discussion, this parameter is misused and is more prone to introducing bugs.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2017

inplace does not generally do anything inplace
but makes a copy and reassigns the pointer

mutating code is much harder to debug (not to mentiin more complicated to support actual inplace ops)

so except for inplace indexing generally these operations are simply more verbose and serve just to provide corner cases for testing

@datapythonista
Copy link
Member

@pandas-dev/pandas-core I think this was agreed during the sprint at Scipy, but I'm not sure if it was discussed when to deprecate the inplace parameters. Is it something we want to do before 1.0?

Personally I think the sooner, the better, since the decision is made.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

this is going to cause major downstream complaints, but I agree this should be sooner rather than later.

Would accept a FutureWarning for this for 0.24.0

@TomAugspurger
Copy link
Contributor

A few questions:

  1. There are some places (.where comes to mind, fillna? Maybe others?) where the operation can truly be done inplace. I don't think those should be deprecated.
  2. Following 1. is it possible to detect when we're doing an "inplace" operation on a copy? Could we only warn in those cases?

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

There are some places (.where comes to mind, fillna? Maybe others?) where the operation can truly be done inplace. I don't think those should be deprecated.

this is a very limited number, and though possible to detect (e.g. a Series of a single dtype, or a DataFrame of a single dtype), IMHO is not worth leaving this option. just adding code complexity w/o any real perf benefit. So -1 on doing this.

@datapythonista
Copy link
Member

In my opinion, ideally we should always do the operation in place if possible, but still return a new object.

So, df = df.fillna(0) wouldn't copy memory (assuming no type change), and if the user does want to modify a copy, they should do df2 = df.copy().fillna(0).

Not sure if in practice can be complex to implement in some cases. But if that makes sense, we should deprecate all inplace arguments.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

So, df = df.fillna(0) wouldn't copy memory (assuming no type change), and if the user does want to modify a copy, they should do df2 = df.copy().fillna(0).

this is too complex. You either do it inplace or you don't. The keyword controls it. If we remove the keyword then it should never modify the original.

@TomAugspurger
Copy link
Contributor

just adding code complexity w/o any real perf benefit.

It depends on the application, but not having to copy can be a pretty big win, right? :)

Still, I agree that this is tough (impossible) to detect ahead of time. Is it feasible to detect it after the fact, and raise a PerformanceWarning in those cases (possibly elevating the an error later, if that makes sense)? In a simplified BlockManager world (one block per column) we could maybe keep a weakref from the column to the actual data (ndarray or the EA's data).

def fillna(self, **kwargs):
    # in BlockManager.fillna
    ref_to_data = self._get_ref_to_data()
    result = self.apply('fillna', **kwargs)
    new_data = self._get_ref_to_data()
    if ref_to_data != new_data and inplace:
         warnings.warn(PerformanceWarning)

If so, I would prefer to go that route, rather than having users change code.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

I am really -1 on this in any users code. So while this may have to be an extended deprecation cycle I think its worth it.

@datapythonista
Copy link
Member

datapythonista commented Dec 3, 2018

This is the list of methods functions in the public API with a parameter named inplace.

EDITED: Moved the lists to the description

Those are not really in place, and should be deprecated:

[EDIT: moved to the issue description]

If that sounds good, I'll create a PR to raise FutureWarning for the ones that we all agree should be deprecated (the first list). Then we can follow up with the rest.

Let me know if there are more from the first list that you want to postpone to the second phase.

@jorisvandenbossche may be you also want to take a look here.

@h-vetinari
Copy link
Contributor

h-vetinari commented Dec 3, 2018

@datapythonista @jreback
Another method that should be part of this discussion is .update, which is AFAICT the only method that's inplace by default, and does not even have an inplace-kwarg. Adding that has met resistance in #22286, but one could deprecate the whole method and replace it with df.coalesce: #22812

@TomAugspurger
Copy link
Contributor

@datapythonista how'd you define that list of methods that are not really inplace? I haven't looked closely, but things like Series.rename_axis should be doable without copying data. I believe clip / clip_lower as well.


I am really -1 on this in any users code.

So that argument is independent of whether or not any operation can be inplace, and we should discuss that. i.e. that it's the "opinion of pandas" that inplace is an anti-pattern to be avoided at all time.

Personally, I'm not sure about that.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

my point is this adds a lot of complexity
sure it’s opionated but so what

complexity is killing the ability of most folks to make modifications

this simplifies the model a great deal

@datapythonista
Copy link
Member

I just started the list of the actual inplace methods with the ones you said. There are some I can guess they should be able to be in place, like replace, but thought you or Jeff would know from the top of your head, so didn't want to guess.

Will move the lists to the description, and try to get it closer to reality, feel free to edit afterwards.

@MarcoGorelli
Copy link
Member

Hi @jorisvandenbossche

It's been 1.5 years since you wrote

but based on my initial experiments with Copy-on-Write (#36195, no PR yet), I am no longer fully convinced that completely deprecating the inplace keyword is the way to go.

Just wondering if your stance has changed at all

@jorisvandenbossche
Copy link
Member

A formal proposal to deprecate most occurrences of the inplace keyword (and the copy keyword as well) has been opened at #51466

@jondo
Copy link

jondo commented Feb 18, 2023

Here's the direct link to the proposal.

@jorisvandenbossche
Copy link
Member

FWIW, the proposal to remove the inplace keyword in certain methods is being finalized (#51466)

@Rinfore
Copy link

Rinfore commented Dec 6, 2023

Is there any way to keep the illusion of a mutating object with a pandas extension? https://pandas.pydata.org/docs/dev/development/extending.html

@jorisvandenbossche
Copy link
Member

@Rinfore can you clarify your question a bit? (pandas objects are still mutable, removing the inplace keyword in methods that didn't actually work inplace won't change that)

If you want some syntactic sugar to avoid having to re-assign to the same variable (like df = df.reset_index() instead of df.reset_index(inplace=True)), you can't do achieve that with an extension type, but I think you could in theory do that with an accessor (it would look something like df.inplace.reset_index(), if the accessor API allows to modify the calling object).
I am not sure if I would recommend doing that, though ;)

@Rinfore
Copy link

Rinfore commented Dec 6, 2023

Thanks so much for the insight @jorisvandenbossche!

I've been working on creating some custom accessors in a domain-specific library that allow to add abstractions with embedded business logic to data frames (e.g. fictitious example: based on presence of columns: [EmployeeId, Time, Event], I can classify the data-frame as a EmployeeRecords data frame) and use extension types on it (df.employeerecords.<method>). This allows users to treat them almost like instances of EmployeeRecords classes, but retain the flexibility to dive into the Pandas API.

In my custom accessors, I may do things such as filter out problematic rows via higher-level functions (e.g. df.employeerecords.drop_inconsistent_records()). Previously, by dropping rows in-place, I could avoid having my users do something like df = df.employeerecords.drop_inconsistent_records() so I was wondering if it would be possible to continue avoiding this (exactly as you mention, syntactic sugar).

I am also wondering if mutating a data frame to add columns (e.g. df['inconsistent'] = ...), or other similar operations, would become impossible in the future, which would be of significant concern to me, as my custom accessors rely heavily on things such as mutating the data frame to add columns or mutate other state.

In summary:

  1. I was wondering whether I could use custom accessors to achieve the syntactic sugar you mentioned, which seemed to be 'maybe' based on your reply above? I can't see how this is possible though, as assigning a new data frame to self._obj in the accessor does not replace the reference to the original object.

e.g.

@pd.api.extensions.register_dataframe_accessor('test')
class TestAccessor:

    def __init__(self, df: pd.DataFrame) -> None:
        self._df = df

    def test_assign(self):
        copied_df = self._df.copy()
        copied_df.columns = ['new_df']
        self._df = copied_df
        
>>> test_df = pd.DataFrame({'old_df': [0]})
>>> test_df.test.test_assign()
>>> test_df.columns
Index(['old_df'], dtype='object') # not Index(['new_df'], dtype='object')

It appears as if the internal assignment in the custom accessor does nothing to modify the reference in the enclosing environment. Thus, I would likely need to return the new reference and have users update their variable references.

  1. Is it likely that mutating pandas objects via operations [], .index = , etc. would be disallowed in the future? (I'm hoping not...)

I apologise if this is not the correct forum to pose these queries, or if I have made some other error or slight in my post.

@kapoor1992
Copy link

7 years since this was opened, is this still something we want to do?

@phofl
Copy link
Member

phofl commented Mar 19, 2024

This is actually covered by PDEP8, I guess we can close here

@phofl phofl added the Closing Candidate May be closeable, needs more eyeballs label Mar 19, 2024
mferrera added a commit to mferrera/xtgeo that referenced this issue Jul 9, 2024
This resolves a few warnings related to accessing things. Removing the
`inplace=True` qualifier seems undesirable despite the warning being
emitted for it, but it turns out that generally when this is given
pandas is creating a copy internally anyway. pandas seem to have a
general desire to remove the `inplace=` options everywhere so that code
is not mutating, but remains more immutable.

See the following comment from years ago.
pandas-dev/pandas#16529 (comment)
mferrera added a commit to mferrera/xtgeo that referenced this issue Jul 9, 2024
This resolves a few warnings related to accessing things. Removing the
`inplace=True` qualifier seems undesirable despite the warning being
emitted for it, but it turns out that generally when this is given
pandas is creating a copy internally anyway. pandas seem to have a
general desire to remove the `inplace=` options everywhere so that code
is not mutating, but remains more immutable.

See the following comment from years ago.
pandas-dev/pandas#16529 (comment)
mferrera added a commit to equinor/xtgeo that referenced this issue Jul 10, 2024
This resolves a few warnings related to accessing things. Removing the
`inplace=True` qualifier seems undesirable despite the warning being
emitted for it, but it turns out that generally when this is given
pandas is creating a copy internally anyway. pandas seem to have a
general desire to remove the `inplace=` options everywhere so that code
is not mutating, but remains more immutable.

See the following comment from years ago.
pandas-dev/pandas#16529 (comment)
zargot added a commit to Big-Life-Lab/PHES-ODM-sharing that referenced this issue Jul 18, 2024
the inplace version of pandas.fillna apparently shouldn't be used

see pandas-dev/pandas#16529
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas inplace Relating to inplace parameter or equivalent Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.