Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for future copy / view semantics in indexing operations #36195

Closed
TomAugspurger opened this issue Sep 7, 2020 · 80 comments
Closed

Proposal for future copy / view semantics in indexing operations #36195

TomAugspurger opened this issue Sep 7, 2020 · 80 comments
Labels

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 7, 2020

Update: concrete proposal for Copy-on-Write: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing


[original post]

Hi all,

During the recent sprint we discussed copy / view semantics in indexing operations. I've tried to summarize a proposal that came out of it.

The tldr is

  1. Indexing operations always return a view where possible (always when indexing columns, if possible when indexing the rows).
  2. For consistent behavior, we implement either Copy on Write or Error on Write (explained in the proposal)
  3. For readability / understandability, we give users the tools to control in code whether mutations to a "child" dataframe (result of an indexing operation) propagate to the parent, while avoiding unnecessary "defensive" copies.

The full proposal is available at https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. I'd like to collect comments there and then propose properly by adding it to the roadmap.

cc @pandas-dev/pandas-core

@jbrockmendel
Copy link
Member

jbrockmendel commented Sep 7, 2020

asv from a branch that changes column-based indexing to always return a view:

       before           after         ratio
     [42289d0d]       [f99abd57]
     <master>         <myway>   
-     9.24±0.06ms      8.77±0.02ms     0.95  algorithms.Hashing.time_series_string
-        57.3±2ms       54.3±0.2ms     0.95  join_merge.MergeAsof.time_by_int('backward', 5)
-      76.1±0.3ms       72.0±0.3ms     0.95  frame_methods.Describe.time_dataframe_describe
-        57.3±1ms       54.0±0.2ms     0.94  join_merge.MergeAsof.time_by_int('backward', None)
-     14.0±0.07ms      13.2±0.08ms     0.94  groupby.MultiColumn.time_cython_sum
-        1.29±0ms      1.20±0.01ms     0.94  timeseries.AsOf.time_asof_nan_single('DataFrame')
-     1.50±0.01ms      1.38±0.01ms     0.92  timeseries.AsOf.time_asof_single('DataFrame')
-     3.62±0.01ms      3.31±0.01ms     0.91  groupby.CountMultiDtype.time_multi_count
-      67.0±0.4ms       60.7±0.4ms     0.91  reshape.Crosstab.time_crosstab_normalize_margins
-      14.8±0.4ms      13.2±0.08ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', 5)
-     14.4±0.09ms      12.8±0.04ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', None)
-      14.7±0.1ms      13.0±0.06ms     0.89  join_merge.MergeAsof.time_on_uint64('nearest', 5)
-     14.3±0.05ms      12.6±0.03ms     0.88  join_merge.MergeAsof.time_on_uint64('nearest', None)
-      69.8±0.5ms       61.4±0.7ms     0.88  reshape.PivotTable.time_pivot_table_margins
-     14.3±0.06ms       12.6±0.1ms     0.88  join_merge.MergeAsof.time_on_int('nearest', 5)
-     12.7±0.05ms      11.1±0.04ms     0.87  join_merge.MergeAsof.time_on_int32('forward', 5)
-     14.0±0.06ms      12.3±0.05ms     0.87  join_merge.MergeAsof.time_on_int('nearest', None)
-     12.5±0.05ms      10.9±0.05ms     0.87  join_merge.MergeAsof.time_on_int32('forward', None)
-     12.1±0.05ms      10.5±0.02ms     0.87  join_merge.MergeAsof.time_on_int32('backward', 5)
-     12.0±0.05ms      10.3±0.04ms     0.86  join_merge.MergeAsof.time_on_int32('backward', None)
-     12.7±0.09ms      10.9±0.04ms     0.86  join_merge.MergeAsof.time_on_uint64('forward', 5)
-     12.2±0.06ms      10.5±0.07ms     0.85  join_merge.MergeAsof.time_on_int('forward', 5)
-     12.4±0.03ms      10.6±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('forward', None)
-     12.0±0.06ms      10.2±0.04ms     0.85  join_merge.MergeAsof.time_on_int('forward', None)
-     11.9±0.05ms      10.1±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', 5)
-     11.8±0.04ms      9.98±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', None)
-      11.4±0.4ms      9.61±0.02ms     0.84  join_merge.MergeAsof.time_on_int('backward', None)
-     11.5±0.06ms      9.71±0.04ms     0.84  join_merge.MergeAsof.time_on_int('backward', 5)
-         407±3μs          291±2μs     0.72  reindex.Reindex.time_reindex_columns
-      3.99±0.1ms       2.72±0.2ms     0.68  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ge')
-     3.56±0.02ms      2.36±0.03ms     0.66  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ne')
-      3.60±0.2ms      2.35±0.02ms     0.65  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('lt')
-     47.8±0.08ms       28.4±0.1ms     0.59  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('all', 1)
-      4.81±0.9ms      2.81±0.03ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('truediv')
-     4.13±0.04ms       2.41±0.2ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('eq')
-        11.9±1ms       6.88±0.5ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('pow')
-     4.12±0.04ms      2.34±0.05ms     0.57  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('le')
-        4.67±1ms       2.55±0.2ms     0.55  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
-        4.90±1ms       2.55±0.2ms     0.52  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
-      40.3±0.2ms       20.5±0.2ms     0.51  frame_methods.Dropna.time_dropna('all', 1)

Methodological note: i did a full asv run, then for everything that came back as changed re-ran those and changes that were not reproducible (e.g. tslibs benchmarks which have to be noise)

@TomAugspurger
Copy link
Contributor Author

I'd like to move forward with this, including a version of the proposal on https://pandas.pydata.org/pandas-docs/dev/development/roadmap.html.

There hasn't been much activity on the google doc, https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. The biggest outstanding item is a decision / discussion of "Copy on Write" vs. "Error on Write". It'd be nice if the proposal was explicit on this point, but I worry we won't be able to decide without actually trying it out.

@jorisvandenbossche
Copy link
Member

Some assorted thoughts I gathered the last weeks (not necessarily all needing an answer in an initial discussion / proposal, though):

  • The copy/view in indexing discussion also impacts methods that rely on indexing internally (eg drop, dropna). I think that is also shown from the fact that dropna performance improves in the benchmark output above.
    So the indexing vs methods (what you put in the "extended proposal" section) are certainly somewhat intertwined, although it is of course possible to first only target indexing by ensuring we do additional copies internally in methods like drop that expect to return a copy.

  • What to do with duplicate column selection? (df['a', 'a']]) Should that also still be a view? (if we would go for views) Mutating one column would then also mutate the other.

  • What kind of mutation operations do we consider in this discussion? We clearly talked about "changing values in-place (setitem)", but there are also other kinds of mutation like changing metadata, flags, ..

  • What to do with the constructors? (eg DataFrame(df)) Also copy-on-write? Or if we keep this as non-copy, what with constructors that have potentially a "reindexing" behaviour? (eg DataFrame(df, columns=[...]))

  • What do we do with accessing a single column as a Series? Do we keep this as a pure view, or do we also want to (eventually) introduce copy-on-write/error for this? (after deprecation cycle)

  • I think the crucial question that deserves more discussion is: are we fine with ditching the SettingWithCopyWarning for those cases it was introduced for?
    Because there is a reason that it was introduced: to warn uses that were doing a setitem operation that might have no effect silently, and thus be confusing / easy to miss for users.

    I think the classical example is something like df[df['B'] > 4]['B'] = 10 - a chained indexing assignment that doesn't work. The warning was introduced to prevent people doing that and silently having no effect.
    Are we fine with it that this will indeed have silently no effect? (which would be the consequence of copy-on-write)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 1, 2020 via email

@TomAugspurger
Copy link
Contributor Author

Agreed that the classic chained indexing example is a compelling argument for Error on Write rather than Copy on Write.

What to do with duplicate column selection? (df[['a', 'a']]) Should that also still be a view? (if we would go for views) Mutating one column would then also mutate the other.

IMO, yes that should create two views to the same data.

What kind of mutation operations do we consider in this discussion? We clearly talked about "changing values in-place (setitem)", but there are also other kinds of mutation like changing metadata, flags, ..

I'd like to keep this focused just on setting data / values. I'm hopeful that the metadata / flags issue can be solved by doing shallow copies by default and providing an option to copy the data if desired, like we did for set_flags.

What do we do with accessing a single column as a Series?

If possible, treating it the same as a DataFrame seems best to me.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2020

is there any system that does error-on-write? I do prefer this, but am a little concerned that we are treading new group vs. copy-on-write (which I think has a large body of systems / literature).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2020

Putting it here because most of the discussion on this topic is still here (but can also move to #36988).

While the classic chained indexing example is indeed an argument for Error-on-Write (for this case, it's basically what we have now but with an error instead of warning), I think there are also arguments/examples to give in favor of Copy-on-Write:

  • I would personally love to get the (IMO more common case) of subset = df[mask] or subset[list_of_columns] and then continuing with subset without worrying about the original df working out of the box, without the user needing to do .copy()/.copy_if_needed() (as is the case now if you don't want to risk getting the warning, even though those are always already copies)

  • There are some cases where we currently don't raise a warning (eg adding a new column to a subset created like above), that woud start raising an error.

  • We basically need Copy-on-Write anyway, because I think this still is the behaviour we want for methods, even if we do Error-on-Write for indexing? (drop, reindex, ...)

Also, I've seen the following behavior with people I've
worked with: ...

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function.
(of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

@jorisvandenbossche
Copy link
Member

Assume that we could actually detect chained indexing? In this hypothetical situation, would people then be more in favor of Copy-on-Write over Error-on-Write?

The assumption would be that we can detect the difference in __setitem__ between

df[df['B'] > 4]['B'] = 10

vs

temp = df[df['B'] > 4]
temp['B'] = 10

I am wondering if we shouldn't be able to actually detect this based on reference counting. In the first case, there is no additional reference to df[df['B'] > 4] than the calling dataframe (self) of the second __setitem__ operation.
(I am not sure how robust such a detection would be though)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 12, 2020

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function.
(of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through. The other issue for me is that copy-on-write silently does something, which might have side effects that a user would like to be made aware of, whereas error-on-write makes the user very aware of what he is doing.

@jorisvandenbossche
Copy link
Member

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

it is tracking refs

but i don't actually think this is that robust - and thats the same reason i am not enthusiastic about copy on write

yes if we can guarantee that we can always do it great

but that's the problem i am not sure we really can

though maybe now that we have full control over PandasArrays it might be fully possible (way back when when i did SettingWithCopy) some ops just indexes directly to the numpy arrays and we had to be very defensive about this

today we have much more robust handling

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 12, 2020

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

My sense is that it is easier to go from the current state of SettingWithCopyWarning to error-on-write and then to copy-on-write, than to go from SettingWithCopyWarning to copy-on-write and then discover we have to do error-on-write because we simply missed something. At least as a first step, we can change all the SettingWithCopyWarning stuff to just raise an exception and we have a first implementation. (That's my sense - I could be totally wrong)

@TomAugspurger
Copy link
Contributor Author

Just to clarify the choice, is it fair to say that the only real choice is between

  1. Error on Write
  2. Copy on Write with SettingWithCopyWarning for chained indexing?

The initial proposal (and my PR) left out the warning specifically for chained indexing part. If we can reliably detect chained indexing (a big if?) then do people prefer option 2?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

@jreback It might be achieved using the same mechanism (I would need to look more into the code of this, it's not a part I am very familiar with), but at least the goal I described is different: the current SettingWithCopyWarning warns in both example cases, while I want only the first (actual chained indexing) to raise, and not the second (assigning the subset to a variable).

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

@Dr-Irv I don't fully understand your last sentence.
It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour).
So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

@jorisvandenbossche
Copy link
Member

Just to clarify the choice, is it fair to say that the only real choice is between

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

@jbrockmendel
Copy link
Member

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 12, 2020

@Dr-Irv I don't fully understand your last sentence.
It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour).
So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

No, I mean let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

@jorisvandenbossche
Copy link
Member

let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

OK, understand it now ;)
But, I personally think we should discuss / choose what we actually prefer. Let's assume that after more discussions we think we prefer copy-on-write over error-on-write, then I would say we should go with that, even if that makes it more difficult to still change mind afterwards.
This also touches upon implementation / adoption questions which we didn't really discuss yet. I think we will need some way to opt-in to this new behaviour initially / have a flag to enable it / .. as an experimental stage in which we can still make breaking changes (like going from copy-on-write to error-on-write, which in principle could also still be done with deprecations first). This might be quite complex to first have both old and new behaviour side by side, but not sure there is a good alternative ..

(and to be clear: while I am mainly arguing here for Copy-on-Write to give some counter-arguments, I am not yet sure myself which of the two is "best". I mainly think it's something to futher explore/experiment/discuss)

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

I also prefer that over status-quo. But, @jbrockmendel, do you also prefer that over the copy-on-write/error-on-write alternatives discussed? (personally, I think that I am currently in favor of one of those two other options)
Note that also in this case of "clear rules", the discussion around what to do with the "typical chained indexing example" is still needed (eg if we have the rule that subsetting rows is always a copy, do we stop raising SettingWithCopyWarning for cases like subset = df[mask] with subsequent changes ?)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 14, 2020

Here's something that I brought up in the October, 2020, dev meeting. Consider:

>>> df = pd.DataFrame({"x":[1,2,3], "y":[4,5,6]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
>>> sx=df['x']
>>> sx.iloc[1] = 10
>>> sx
0     1
1    10
2     3
Name: x, dtype: int64
>>> df
    x  y
0   1  4
1  10  5
2   3  6
>>> sdf = df[['x']]
>>> type(sdf)
<class 'pandas.core.frame.DataFrame'>
>>> sdf
    x
0   1
1  10
2   3
>>> sdf.loc[1,'x'] = 20
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> sdf
    x
0   1
1  20
2   3
>>> df
    x  y
0   1  4
1  10  5
2   3  6

Right now, the behavior of picking a single column using the syntax df['x'], which creates a Series versus df[['x']] which creates a DataFrame is different, in that when you get a Series, modification of that series silently modifies the underlying DataFrame while when you get a DataFrame, you get the SettingWithCopyWarning. If we choose copy-on-write, then someone who used the df['x'] syntax with the dependence that the underlying DataFrame was modified will no longer have working code, and will have a hard time figuring out that a new version of pandas is what caused the issue.

On the other hand, if we use error-on-write, then in BOTH cases, we raise an error, which makes it clear to the user that they need to make their intent clear.

@jbrockmendel
Copy link
Member

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 14, 2020

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

@jbrockmendel
Copy link
Member

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 14, 2020

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

Yes, the proposal has a as_mutable_view() method to allow the user to indicate they want to allow mutation. Works whether we do COW or EOW.

@toobaz
Copy link
Member

toobaz commented Oct 14, 2020

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

Sorry, I'm missing the difference between these "clear rules" and the "whenever possible", could you clarify?

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented? Either df[["A"]] already provides a mutable view - hence the method call is noop - or it provides a copy of data - and it will be painfully hard to keep track and edit the original data (right?).

A syntax I would understand would be something like df.v[["A"]] - where .v is lazy and only operates when indexed. But even leaving aside the syntax, are we thinking this should work for all indexing operations?! For instance df.loc[['a', 'b'], slice(1, 10, 4)]?! Not that it's impossible, but it does looks difficult to me.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 14, 2020

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented?

Indexing columns would always give views. The difference comes when you try to mutate the view

>>> df2 = df1[["A']]  # df2 is a view on df1
>>> df2.iloc[:, :] = 0  # Copies (with Copy on Write) or raises (with Error on Write).

And for the case of "I want to mutate df1, you'd use df2 = df1[["A"].as_mutable_view().

are we thinking this should work for all indexing operations?

I think this discussion is limited to

  • Indexing just the columns
  • (maybe) Indexing the rows in a way where NumPy can return a view when slicing a 1D array. So df.loc[:5, ["A"]] can be a view. But your example of ``df.loc[['a', 'b'], ["A"]]` cannot be a view, and would copy immediately.

If that sounds right, I'll clarify it in the proposal document.

@justmarkham
Copy link
Contributor

Hi all. I'm a long-time pandas user, and I've been teaching pandas to beginner/intermediate data science students since 2015. Much of that teaching is online, and as a result I've answered hundreds (or perhaps thousands) of pandas questions from newer users. That is all to say that I have somewhat of a pulse on what newer pandas users ask about.

I read through the copy-on-write proposal, and I wanted to share a few reactions to the proposal:

  1. As a pandas user, I don't want to have to memorize complicated rules in order to properly use the library, regardless of whether the rules are well-documented or not. The proposal gives me a simple rule to memorize, and so I'm in favor of that.

  2. I don't know what most pandas users expect, but personally, I never intend to mutate the "original" object when changing a "derived" object. Thus the proposal would make things actually work like I would hope they would work.

  3. I learned pandas before NumPy, and in fact I still don't know NumPy that well. As such, I'm not of the belief that "pandas always needs to behave consistently with NumPy". I still teach pandas with the assumption that the student knows nothing about NumPy, and I don't believe that every pandas user needs to also learn NumPy at some point. However, I do understand that many pandas users do know NumPy, and those users might be thrown off by differences in behavior between pandas and NumPy.

  4. If I understood correctly, chained assignment would stop working under this proposal. I would be in favor of chained assignment no longer being allowed, even for cases in which it currently works, since you can always just use loc instead (right?). In that case, I would want an error (or at least a warning) to be raised if you try to do a chained assignment.

I hope some of this is helpful to the discussion!

@chendaniely
Copy link
Contributor

Hello maintainers. Thank you for putting out a thoughtful proposal and getting more community feedback in the process.

I echo all of justmarkham's comments above, and left my comments in the google doc.

the tl;dr:

  • Proposal is great for an instructor who teaches novices. It seems to be most aligned with the learner's mental model when I see their pandas code.
  • I'm do not know about any of the performance implications of this change. And what use cases people have implemented that rely on the copy/view behaviors in pandas.
  • Documentation-wise I think all of the examples in the google document would fit well in the indexing and selecting data chapter in the user guide, and let me know how I can help on that end.

Here's an example of what things look like from the R world of copy-on-modify: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062

@chendaniely
Copy link
Contributor

Sorry for the double post, but I updated the above gist for what "chained indexing" looks like from the R world (if that helps any)

The direct file is here: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062#file-chained_indexing-base_tidyverse-r

@jorisvandenbossche
Copy link
Member

@justmarkham @chendaniely Thanks a lot for the feedback, that's really valuable.

I am personally happy to hear that you both think the current proposal would be more intuitive for (novice) users than the current situation (since this was one of my goals with the proposal). I also think that we should not assume our users to know (numpy) concepts of copy vs view in indexing.

[@justmarkham] If I understood correctly, chained assignment would stop working under this proposal. I would be in favor of chained assignment no longer being allowed, even for cases in which it currently works, since you can always just use loc instead (right?).
[@chendaniely] I updated the above gist for what "chained indexing" looks like from the R world

So chained assignment actually works in R!
With my current proposal, chained assignment would indeed never work.

There is one aspect related to this brought up by @shoyer on the mailing list: we could make one exception on the copy-on-write rule, i.e. accessing a single column of a DataFrame as a series would still be a view, any other indexing operation would follow copy-on-write as proposed. I recently commented about this on the mailing list at https://mail.python.org/pipermail/pandas-dev/2021-October/001395.html

This would ensure that one type of chained assignment (and possibly the most common one) keeps working: df[col][...] = .... This would make the transition smoother.
But at the same time, exactly because it is only one type of chained assignment that would work, the user needs to understand that selecting the column first works, but something like df[row_selection][col] = ... does not work. Although it seems a simple exception to the general rule, my feeling is that it will give rise to several potentially confusing corner cases.

[@chendaniely] I'm do not know about any of the performance implications of this change. And what use cases people have implemented that rely on the copy/view behaviors in pandas.

Yes, that's a good point and something I should explore in more detail (and add a section about in the document). I think in general for the majority of workflows, the performance will either stay the same or improve (because of avoiding unnecessary copies).

There will be specific cases that might be impacted though:

  • The relatively frequent case of mutating a dataframe that has a reference to another object: this can potentially become slower (if it's a case where we don't copy right now), or see a "delayed" impact (overall performance might be the same but the copy is now delayed at the moment when actually mutating)
  • A potential corner case that is heavily impacted is where you mutate a dataframe repeatedly in a for loop, and where you do some operation in the for loop that triggers the copy-on-write mechanism repeatedly (I should try to come up with some example code, but can't think of a sensible example right now. For example, @shoyer showed one example in an old discussion at COMPAT: remove SettingWithCopy warning, and use copy-on-write, #10954 #10973 (comment), but that would actually not trigger copy-on-write in my proof-of-concept implementation since I keep track of views per column instead of per full dataframe).

@jorisvandenbossche
Copy link
Member

As a further update on this thread: I have an initial implementation at #46958 (now for the default BlockManager, not for the opt-in ArrayManager). I also opened a separate PR that tries to add the examples from the proposal (and more) as unit tests: #46979

@nickeubank
Copy link
Contributor

A little cross promotion: https://youtu.be/aBeEN2klZQE

@mattharrison
Copy link

mattharrison commented May 20, 2022

This can't come soon enough. (Background: I teach 50-100+ folks per month on Pandas and wrote Learning the Pandas Library, Pandas Cookbook 2nd Ed, and Effective Pandas. I also consult on Pandas.)

Happy to give a prototype a run through on some of my Pandas material. However, I never hit our friend SettingWithCopyWarning with the code I write. (I am interested in seeing speed/memory improvements though.)

As far as some code breaking, IMO you should release the next minor version before this gets pushed updating the deprecation warning to state that certain behavior will be removed, and then push a minor (major?) version with the fix.

WRT "the breaking away from NumPy behavior" argument preventing this. No one I teach (or have consulted with) cares about that. Most of them don't use NumPy (and don't need to). Seems like a weird implementation detail to prevent such a useful update.

Again, this can't come soon enough for those who haven't embraced chaining.

@nickeubank
Copy link
Contributor

@jorisvandenbossche Out of curiosity, while I recognize that moving to copy on view is technically a breaking change, are you aware of any real-world examples where people are relying on pandas returning views in their code?

The rules around when pandas returns views are so idiosyncratic and sensitive to apparently-unrelated code changes that I've never felt remotely safe using views, and am curious if anyone else ever does...

@matthew-brett
Copy link
Contributor

I just wanted to add a huge - yes please - how can I help - when can we have this! And @nickeubank - I agree - I never feel safe deliberately using views. I guess the problem is that a lot of code doesn't deliberately use them, but uses them by accident. I did offer to do a survey to find out if that happened in real life, in #36195 (comment) - but I would in general love to know what it would take to get this in.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 23, 2022

Out of curiosity, while I recognize that moving to copy on view is technically a breaking change, are you aware of any real-world examples where people are relying on pandas returning views in their code?

I think that any case of chained assignment that currently works (intentionally, or somewhat by accident) is an example using a view that for sure people do rely on.

Apart from that, I fully agree that it's probably not common that people create an actual subset dataframe that is a view, and mutate that with the intention to also mutate the original dataframe (which is for me a good reason to not do this, because this is not what people expect, and in a large majority of cases also not what people want).

And to be clear, the chained assignment comprises both the typical case of doing something like df["C"][df["A"] > 1] = 10 (example from my slides), as the non-typical cases like df[1:3]["C"] = 10.
But I think the first one (i.e. selecting a column, and then directly mutating that column in a chained setitem) is certainly a common pattern (although we will typically recommend to use .loc instead), and the fact that this will no longer work will certainly affect quite some people. (I do think that we can (first warn and then) error for this, to make this very explicit that it doesn't work, and to avoid silent failures because of this change)

@nickeubank
Copy link
Contributor

But I think the first one (i.e. selecting a column, and then directly mutating that column in a chained setitem) is certainly a common pattern (although we will typically recommend to use .loc instead), and the fact that this will no longer work will certainly affect quite some people

Yeah, ok, I can see that! The compactness of the pattern means all the weird ways a view can "break" don't come into play, and so I can see how it would be frequently used naively. Thanks!

@jorisvandenbossche
Copy link
Member

About the "what would it take to get this in / how can I help":

First, it would actually be good to get some more explicit support from other pandas core devs (it's hard to estimate to what extent there is consensus around the latest proposal).

But apart from that:

  • We need an implementation:
    • API: New copy / view semantics using Copy-on-Write #46958 implements the core of the proposal, and mainly needs people with time to review this.

    • Apart from that core implementation, there will be other work needed to fully implement the proposal. Some additional work items:

      • Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW)
      • Use the lazy copy (with CoW) mechanism in more methods where it can be used (currently I only explicitly did reset_index and rename as POC)
      • Explore / update the APIs that return numpy arrays (.values, to_numpy()). Potential idea is to make the returned array read-only by default
      • ...

      (I should create a tracker issues with a task list for those items)

    • For now I only implemented the new behaviour (toggled with an option), but in addition we will also need to implement all the deprecation warnings for cases that will change in the current behaviour (initially also behind an option).

  • Once the (core) implementation lands in master / a release, then it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.

So on the short term helping with with reviewing or some extra implementation tasks is most needed, but certainly also not super approachable to help with.
Once there is a basic implementation it will be easier to help with testing / giving feedback (although it's certainly already welcome to give the PR branch a try as well).

@davidleon123 davidleon123 mentioned this issue Dec 20, 2022
2 tasks
@mroeschke
Copy link
Member

I think this approached has been solidified with #51463 and any future discussion should occur in #48998 or in individual issues so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests