Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: copy=True keyword in methods where copy=False can improve perf #48141

Closed
jbrockmendel opened this issue Aug 18, 2022 · 13 comments
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 18, 2022

xref #48117, #47993, #48043, #47934, #47932

We have a number of methods that currently make copies but don't have to, mainly methods that revolve around index/columns while leaving the data untouched. e.g. DataFrame.droplevel. By adding a keyword copy=True to these methods, we give users the choice to avoid making a copy by setting copy=False, thereby improving performance.

@jorisvandenbossche makes a reasonable point #48117 (review) that adding this may interact with potential copy/view changes in upcoming versions.

Besides droplevel and drop, for which PRs are already open, the other methods I'm looking at are reset_index, replace, and fillna (there may be others I haven't found). In all three of those cases, the idea would be to add copy and deprecate inplace (xref #16529).

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2022
@mroeschke
Copy link
Member

Should we revert #48043, #47934, #47932 in the meantime?

@jorisvandenbossche
Copy link
Member

On the PR I said:

And that change of behaviour is quite subtle (it's not something that we can easily deprecate about for those specific cases). I would really like to avoid that more people start to rely on the mutability of dataframes returned in such a copy=False case.

On second thought, this is not fully true. We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option). But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2022

ok i agree that this could be a can of worms and we have one change to do this right so will propose

  • let's revert the changes adding copy= for 1.5
  • let's inventory (create an issue) all methods and see where we have copy= and/or inplace= keywords
  • after 2.0 i think we should deprecate inplace= and add copy=

@lithomas1 lithomas1 added Enhancement Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2022
@jbrockmendel
Copy link
Member Author

We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option).

I've been assuming that, since we have existing methods with copy=True keywords, that we will end up making a decorator to deprecate those if/when CoW makes them redundant. If so, the marginal cost (on the dev side) to deprecating extra methods would be low.

But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.

That's fair. My thought going in was a) we an offer improved performance now, while such a deprecation is both future and not-assured, b) all else equal, an explicit copy is a better API than an implicit one (CoW).

If we do add the keyword now(ish), we could also document that there is a decent chance of an upcoming change.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so copy=True makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something like result._mgr.refs = None before returning result? Would result be an actual view in that case? Would that Break The World for CoW?

@jorisvandenbossche
Copy link
Member

what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so copy=True makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something like result._mgr.refs = None before returning result? Would result be an actual view in that case? Would that Break The World for CoW?

Yeah, based on my current understanding, that would break the CoW logic. Or at least it would introduce hard to predict behaviour. For example, you can set the refs of the resulting dataframe itself to None, but there might still be other objects referencing the same underlying data. So it is not a guarantee that no CoW happens (since we also check if the data is being referenced), or if we avoid CoW, it can break the guarantee of another object not getting mutated.
See also my comment here #36195 (comment) about the shallow copy concept (i.e. copy=False or copy(deep=False)) not being possible anymore in its current meaning when using CoW.

We might want to add an advanced feature to actually do something like this, at some point if there is demand for that. But if we do so, I think it should be 1) new API (so you don't accidentally get this because you already did copy=False before), and 2) it might be better to expose this in methods that do the actual mutation (so you can say: I know what I am doing, for this mutation, do not do any CoW ref checks)

In general for the copy keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and have copy=True do an actual (eager) copy, while copy=False (the default) uses delayed CoW (that's what #46958 did for the few methods where I already updated this, see #46958 (comment)).
There can be some value in allowing users to do copy=True (although limited, since you can also use the copy() method afterwards if you want this). Alternatively, we could also use copy=<no_default> instead of copy=False as the default. Or we could also simply remove the copy keyword eventually.

If so, the marginal cost (on the dev side) to deprecating extra methods would be low.

On the dev side yes, but on the user side we would still be advising people to use copy=False now (eg from an inplace deprecation), which they potentially need to change again then, so in which case users would have to update their code twice.

all else equal, an explicit copy is a better API than an implicit one (CoW).

Yes, but that's for copy=True. If we are adding a new copy keyword now in certain methods, it is for users to do copy=False? (since the True case is already the current behaviour)

@mroeschke
Copy link
Member

@jbrockmendel would you have time to revert #48043, #47934, #47932? I was attempting to revert but got lost in all the merge conflicts as I think inplace might or might not supposed to be deprecated somewhere as well

@jbrockmendel
Copy link
Member Author

I'll give it a go this afternoon

@jbrockmendel
Copy link
Member Author

Yes, but that's for copy=True. If we are adding a new copy keyword now in certain methods, it is for users to do copy=False? (since the True case is already the current behaviour)

Correct, its to give users the option to get the more performant behavior (including internally, see a couple usages in #48117). In particular I have in mind use cases with method chaining where you might do something like df.foo(copy=False).bar(copy=False).baz(copy=False).[...].zoom(copy=True)

In general for the copy keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and have copy=True do an actual (eager) copy, while copy=False (the default) uses delayed CoW (that's what #46958 did for the few methods where I already updated this, see #46958 (comment)).

Yah this is a tough one. If/when CoW is enabled by default, do you expect there will still be an option to disable it via pd.options? If so, I expect we'd want to keep the copy kwd available for this mode.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2022

i think we should wait to avoid confusion with CoW

then make a single change for 2.0 (or even wait until cow is default)

@jbrockmendel
Copy link
Member Author

Opened #48201 to revert #48934. For the other two, we've also deprecated inplace, would need to revert those deprecations if reverting the copy kwd. I'm not wild about this, but will do it if there's consensus.

@jorisvandenbossche
Copy link
Member

Sorry for the slow follow-up here.

For the other two, we've also deprecated inplace, would need to revert those deprecations if reverting the copy kwd. I'm not wild about this, but will do it if there's consensus.

I would personally also prefer to revert those, but we would need to indeed revert also the deprecation of inplace. The same arguments as discussed above apply to those methods.

I opened #48417 for set_index, in case we agree on this path forward.

@simonjayhawkins simonjayhawkins added the Closing Candidate May be closeable, needs more eyeballs label Mar 10, 2023
@jbrockmendel
Copy link
Member Author

CoW becoming the default/only behavior seems increasingly likely. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

7 participants