Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731

Merged
merged 29 commits into from
Mar 17, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 1, 2023

@phofl phofl marked this pull request as draft March 1, 2023 23:47
and (dtype is None or is_dtype_equal(values.dtype, dtype))
and copy_on_sanitize
):
values = np.array(values, order="F", copy=copy_on_sanitize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order="F" thing something we should be doing in general in cases with copy=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are operating on the transposed array when copying a DataFrame object, so not needed in that case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #44871 took a look at preserving order when doing copy. there are some tradeoffs here

@@ -685,6 +685,8 @@ def __init__(
# INFO(ArrayManager) by default copy the 2D input array to get
# contiguous 1D arrays
copy = True
elif using_copy_on_write() and isinstance(data, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative is to express this in the inverse: .. and not isinstance(data, (Series, DataFrame)).

That would ensure that also other "array-likes" that would coerce zero-copy into a numpy array but are not exactly np.ndarray would get copied by default (not fully sure how our constructor currently handles such array like objects, though).
I think in the end only for pandas objects we can keep track of references, so only in that case the default should be a shallow copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try and see what breaks

@@ -762,6 +764,9 @@ def _try_cast(

subarr = maybe_cast_to_integer_array(arr, dtype)
else:
subarr = np.array(arr, dtype=dtype, copy=copy)
if using_copy_on_write():
subarr = np.array(arr, dtype=dtype, copy=copy, order="F")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the order="F" specifically needed for CoW? (and can you add a comment about it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about #50756

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that it's related to that, but I don't understand why we would only do it for CoW? The default is not to copy arrays (without CoW) at the moment, which creates this inefficient layout; but so if the user manually specifies copy=True in the constructor, why not directly copy it to the desired layout in all cases without the if/else check for CoW?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah did not think about this, will add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @jbrockmendel s comment above I think we should leave it out for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @jbrockmendel s comment above I think we should leave it out for now

Leave it out in general (from this PR), or you mean not do it for the default mode for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely default mode and maybe also split the copy and the layout change into two PRs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reiterating preference for this to be separate. The two of you are much more familiar with the CoW logic than most of the rest of us; i get antsy when i see using_copy_on_write checks appearing in new places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all order relevant changes, is more straightforward now

np.random.randn(10, 3),
index=index,
columns=Index(["A", "B", "C"], name="exp"),
copy=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the False "needed" here (did it otherwise give failures), or just for efficiency since this is an example case where we know the array is not owned by anyone else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying here causes one test to fail, which is very weird(the failure). Haven't looked closer yet, but the test is useless as soon as your read_only pr is merged.

Want to understand what's off there nevertheless though

@@ -57,7 +57,10 @@ def test_fillna_on_column_view(self, using_copy_on_write):

# i.e. we didn't create a new 49-column block
assert len(df._mgr.arrays) == 1
assert np.shares_memory(df.values, arr)
if using_copy_on_write:
assert not np.shares_memory(df.values, arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or pass copy=False to the constructor instead?

Because now the check above about assert np.isnan(arr[:, 0]).all() is kind of useless because arr was copied and so of course will not be modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since this is inplace=True and there are no other references to df, shouldn't arr be modified also with CoW before this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are doing df[0] which is a chained assignment case. Updated the test to set copy=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, missed the [0] ..

@jbrockmendel
Copy link
Member

I think many of the users/cases passing a single ndarray to DataFrame expect that to be no-copy and for pd.DataFrame(arr).values to round-trip without making a copy. Asking them to pass copy=False isn't a huge burden, but it does add some friction to asking the to try out CoW, which im increasingly excited about.

it would help if we could disentangle "this makes CoW behave better" from "this makes reductions more performant" as the motivation here

@phofl
Copy link
Member Author

phofl commented Mar 3, 2023

This is just a solve two things with one pr thing here, happy to remove the order if you think this causes problems.

the actual problem I am trying to solve is the following:

arr = np.array([[1, 2], [3, 4]])
df = DataFrame(arr)
# Do some operations that don’t copy
arr[0, 0] = 100

If we don’t do a copy in the constructor, updating the array will update all dataframes as well

Edit: my motivation was that if we do a copy anyway, we can also change the order as well

@phofl phofl marked this pull request as ready for review March 3, 2023 23:45
# Conflicts:
#	pandas/tests/frame/methods/test_transpose.py
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Mar 13, 2023
@jorisvandenbossche jorisvandenbossche added the Constructors Series/DataFrame/Index/pd.array Constructors label Mar 13, 2023
@jorisvandenbossche jorisvandenbossche changed the title CoW: Implement copy by default for df construction with NumPy array API / CoW: Copy NumPy arrays by default in DataFrame constructor Mar 13, 2023

# TODO(CoW): This should raise a chained assignment error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the list in the overview issue #48998

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx

assert df.to_numpy(copy=False).base is arr
if using_copy_on_write:
assert df.values.base is not arr
assert df.to_numpy(copy=False).base is not arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something like df.to_numpy(copy=False).base is df.values.base (because I think this was part of the intention of the test to verify that to_numpy(copy=False) didn't make a copy, and not so much that DataFrame(arr) doesn't make a copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -129,7 +129,10 @@ def test_transpose_get_view_dt64tzget_view(self):
assert result._mgr.nblocks == 1

rtrip = result._mgr.blocks[0].values
assert np.shares_memory(arr._ndarray, rtrip._ndarray)
if using_copy_on_write:
assert not np.shares_memory(arr._ndarray, rtrip._ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, for the intent of the test, I think we should still try to verify that df.T shares the memory with df?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good! (I don't have a strong opinion about splitting off the order="F" changes. Doing it separately might make it easier to do some extra perf tests on the examples that were identified below as slower, but personally I think it's also fine to just keep as is here in the PR).
Only added some comments on the tests.

Should we add some explicit tests to copy_view/test_constructors.py?

@@ -306,18 +306,24 @@ def test_constructor_dtype_nocast_view_2d_array(
assert df2._mgr.arrays[0].flags.c_contiguous

@td.skip_array_manager_invalid_test
def test_1d_object_array_does_not_copy(self):
def test_1d_object_array_does_not_copy(self, using_copy_on_write):
# https://github.com/pandas-dev/pandas/issues/39272
arr = np.array(["a", "b"], dtype="object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add copy=False here to keep the test as is? (and to keep coverage of that case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

# https://github.com/pandas-dev/pandas/issues/39272
arr = np.array([["a", "b"], ["c", "d"]], dtype="object")
df = DataFrame(arr)
assert np.shares_memory(df.values, arr)
if using_copy_on_write:
assert not np.shares_memory(df.values, arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 296 to 307
elif (
using_copy_on_write()
and isinstance(values, np.ndarray)
and (dtype is None or is_dtype_equal(values.dtype, dtype))
and copy_on_sanitize
):
values = np.array(values, copy=copy_on_sanitize)
values = _ensure_2d(values)

elif isinstance(values, (np.ndarray, ExtensionArray, ABCSeries, Index)):
# drop subclass info
values = np.array(values, copy=copy_on_sanitize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, don't know if this changed from the original version or I missed it before, but what is this additional check exactly doing? Because it seems to do exactly the same as the next elif block, and whenever you fullfill the new elif check, you would also fulfill the existing one (since the existing elif will pass whenever values is an ndarray, but without the extra checks)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is a leftover from the order change, removed it

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a whatsnew note?

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

One other question, but not for this PR: this does it for the DataFrame constructor; so we should still do a follow-up for the Series constructor as well?

@phofl
Copy link
Member Author

phofl commented Mar 15, 2023

Good point, added

@phofl
Copy link
Member Author

phofl commented Mar 15, 2023

Yes we should, will put up a pr when this is merged

@@ -190,6 +190,10 @@ Copy-on-Write improvements
of Series objects and specifying ``copy=False``, will now use a lazy copy
of those Series objects for the columns of the DataFrame (:issue:`50777`)

- The :class:`DataFrame` constructor, when constructing from a NumPy array,
will now copy the array by default to avoid mutating the :class:`DataFrame`
when mutating the array. Specify ``copy=False`` to get the old behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a warning like (about copy=False): "in that case pandas does not guarantee correct Copy-on-Write behaviour in case the numpy array would get modified after creating the DataFrame"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@jorisvandenbossche jorisvandenbossche merged commit d534007 into pandas-dev:main Mar 17, 2023
@jorisvandenbossche
Copy link
Member

Thanks!

@lumberbot-app
Copy link

lumberbot-app bot commented Mar 17, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 d534007e4cf07b8a8070f0ff9fe8875e5566f2d8
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #51731: API / CoW: Copy NumPy arrays by default in DataFrame constructor'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-51731-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51731 on branch 2.0.x (API / CoW: Copy NumPy arrays by default in DataFrame constructor)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@phofl
Copy link
Member Author

phofl commented Mar 17, 2023

I'll open a backport pr

@phofl phofl deleted the cow_copy_numpy_array branch March 17, 2023 14:14
phofl added a commit to phofl/pandas that referenced this pull request Mar 17, 2023
…das-dev#51731)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
phofl added a commit that referenced this pull request Mar 17, 2023
…efault in DataFrame constructor) (#52047)

* API / CoW: Copy NumPy arrays by default in DataFrame constructor (#51731)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* Fix test

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants