Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array #50776

Closed
Tracked by #48998
jorisvandenbossche opened this issue Jan 16, 2023 · 3 comments · Fixed by #52022
Closed
Tracked by #48998
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Context: with the Copy-on-Write implementation (see #36195 / detailed proposal doc, and overview follow up issue #48998), we generally want to avoid that updating (setting) values in one DataFrame/Series also changes another DataFrame/Series.
When constructing a DataFrame/Series from existing DataFrame/Series objects this can be handled correctly (the newly constructed object keeps track of the original ones, ensuring to correctly trigger the Copy-on-Write mechanism when needed). However, when constructing a DataFrame/Series from a numpy array, this doesn't work (the CoW mechanism works on the level of BlockManager and blocks).

Currently the DataFrame(..)/Series(..) constructors work "zero-copy" for a numpy array input (i.e. viewing the original ndarray). While it would be nice to keep behaviour for performance reasons, this also creates some problems for maintaining the copy/view rules under Copy-on-Write.

There are several aspects to this behaviour (I am showing the examples for Series, but the same is true for constructing DataFrame from a 2D array):

1) Updating the Series/DataFrame also updates the numpy array:

>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> s.iloc[0] = 100
>>> arr
array([100,   2,   3])

2) Updating the numpy array also updates the Series/DataFrame:

>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> arr[0] = 100
>>> s
0    100
1      2
2      3
dtype: int64

3) When creating multiple pandas objects viewing the same numpy array, updating the one will also update the other:

>>> arr = np.array([1, 2, 3])
>>> s1 = pd.Series(arr)
>>> s2 = pd.Series(arr)
>>> s1.iloc[0] = 100
>>> s2
Out[13]: 
0    100
1      2
2      3
dtype: int64

The first two cases could maybe be seen as corner cases (and documented as such, assuming changing the numpy array after DataFrame/Series creation is an "advanced" use case), although IMO ideally we should avoid that users can accidentally mutate DataFrame/Series in this way. But the third case clearly violates the principles of the copy/view rules under the CoW proposal.

I can think of two ways to fix this:

  1. We always copy input arrays in the constructors (i.e. effectively changing the default to copy=True).
    • We will still need some way to create the Series/DataFrame without a copy (for example internally, if we know the array was just created and isn't passed by the user), and we could maybe use the copy=False option for this? (but we will have to clearly document that you then circumvent CoW and we don't give guarantees about the resulting copy/view behaviour when the array or dataframe/series gets mutated.
    • Ideally we would avoid this for performance reasons, I think, at least for Series (and DataFrame in case the input array has the correct memory layout). But for typical 2D numpy arrays, making a copy by default could actually improve performance: PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756
  2. We keep using a view by default (like we can do for DataFrame/Series input to the constructors), but use some other mechanism to enforce correct CoW behaviour.
    • One idea would be to change the input array to read-only (arr.flags.writeable = False). This would avoid the user to accidentally mutate the array and that way also update the pandas object (you can still change the flags, but that should give a hint to the user about what it is doing), and at the same time that would trigger pandas to make a copy of the array when the user tries to mutate the DataFrame/Series (triggering CoW, but using a different mechanism).
@phofl
Copy link
Member

phofl commented Feb 1, 2023

I'd be in favour of 1. Changing the default to copy=True but still allowing copy=False (and clearly documenting the caveats). I am also very much in favour of changing the memory layout in the process of making a copy.

@MarcoGorelli
Copy link
Member

@jorisvandenbossche any reason to move this back to the 2.0 milestone? Is it a blocker?

@phofl
Copy link
Member

phofl commented Mar 29, 2023

this will be finished by #50776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants