API: Public data for Series and Index: .array and .to_numpy() #23623

TomAugspurger · 2018-11-11T13:17:53Z

Closes #19954.

TODO:

update references to .values in the docs to use .array or .to_numpy()
cross ref between .values and the rest

This adds two new methods for working with EA-backed Series / Index. - `.array -> Union[ExtensionArray, ndarray]`: the actual backing array - `.to_numpy() -> ndarray`: A NumPy representation of the data `.array` is always a reference to the actual data stored in the container. Updating it inplace (not recommended) will be reflected in the Series (or Index for that matter, so really not recommended). `to_numpy()` may (or may not) require data copying / coercion. Closes pandas-dev#19954

pep8speaks · 2018-11-11T13:17:59Z

Hello @TomAugspurger! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/base.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/multi.py !
There are no PEP8 issues in the file pandas/tests/test_base.py !

jbrockmendel · 2018-11-11T19:22:32Z

Not having read the diff, first thought is that I’d like to wait for DatetimeArray to see what we can simplify vis-a-vis values/_values

TomAugspurger · 2018-11-11T19:46:34Z

Can you expand on that? `.values` isn't changing (aside from recommending against it in the docs). `.array` is implemented in terms of `._values` right now, but we can adjust remove that level of indirection if we want, and that will have to wait till DatetimeArray is done.

…

On Sun, Nov 11, 2018 at 1:22 PM jbrockmendel ***@***.***> wrote: Not having read the diff, first thought is that I’d like to wait for DatetimeArray to see what we can simplify vis-a-vis values/_values — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23623 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIs7PBHbB1FrTd8xFy5ZmGStpU4KCks5uuHj8gaJpZM4YYert> .

jbrockmendel · 2018-11-11T19:48:10Z

Not at all a well thought out opinion, just gut reaction.

Split from pandas-dev#23623, where it was causing issues with infer_dtype.

BUG: Ensure that Index._data is an ndarray Split from pandas-dev#23623, where it was causing issues with infer_dtype.

commit e4b21f6 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 16:09:58 2018 -0600 TST: Change rops tests commit e903550 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 09:31:38 2018 -0600 Add note [ci skip] ***NO CI*** commit fa8934a Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 06:16:53 2018 -0600 update errors commit 505970e Merge: a30bc02 3592a46 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 05:55:31 2018 -0600 Merge remote-tracking branch 'upstream/master' into index-ndarray-data commit a30bc02 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Sun Nov 11 15:14:46 2018 -0600 remove assert commit 1f23ebc Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Sun Nov 11 15:01:13 2018 -0600 BUG: Ensure that Index._data is an ndarray BUG: Ensure that Index._data is an ndarray Split from pandas-dev#23623, where it was causing issues with infer_dtype.

codecov · 2018-11-13T17:29:13Z

Codecov Report

Merging #23623 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23623      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51513    51526      +13     
==========================================
+ Hits        47554    47567      +13     
  Misses       3959     3959

Flag	Coverage Δ
#multiple	`90.71% <100%> (ø)`	⬆️
#single	`42.48% <50%> (+0.05%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`93.68% <ø> (ø)`	⬆️
pandas/core/generic.py	`96.84% <ø> (ø)`	⬆️
pandas/core/indexes/multi.py	`95.51% <100%> (+0.01%)`	⬆️
pandas/core/base.py	`97.64% <100%> (+0.03%)`	⬆️
pandas/core/frame.py	`97.03% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.48% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b2dab...f9eee65. Read the comment docs.

doc/source/basics.rst

doc/source/dsintro.rst

doc/source/whatsnew/v0.24.0.rst

pandas/core/indexes/base.py

jreback · 2018-11-21T13:16:11Z

pandas/tests/test_base.py

@@ -1269,3 +1269,54 @@ def test_ndarray_values(array, expected):
    r_values = pd.Index(array)._ndarray_values
    tm.assert_numpy_array_equal(l_values, r_values)
    tm.assert_numpy_array_equal(l_values, expected)
+
+
+@pytest.mark.parametrize("array, attr", [


maybe put in pandas/tests/arrays/test_arrays.py?

Doesn't exist yet :), though my pd.array PR is creating it.

This seemed a bit more appropriate since it's next to our tests for ndarray_values.

shoyer · 2018-11-27T17:36:08Z

Do time-zone naive datetimes use extension arrays?

I'd love to have a clear rule for .array:

It's an ndarray for builtin types supported by numpy.
It's an extension array for pandas extension arrays.

TomAugspurger · 2018-11-27T17:47:44Z

Do time-zone naive datetimes use extension arrays?

That's being worked on right now. It's unclear to me how things will end up, but it's possible there will be an ExtensionArray between Series and the actual ndarray for tz-naive data. This will certainly be the case for DatetimeIndex.

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

I'd love to have a clear rule for .array:

It's an ndarray for builtin types supported by numpy.

It's an extension array for pandas extension arrays.

I like this rule.

shoyer · 2018-11-27T19:14:27Z

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

In that case I suppose it mostly comes down to:

which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?
which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values.

jorisvandenbossche · 2018-11-27T19:52:25Z

which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?

"Ideally", if a user wants a numpy array, they use to_numpy or np.array(..), and they should only use .array if they don't really care about the distinction between both, but want to do a certain operation on the values (eg an operation without alignment and then put it back in a Series).
But of course, it is difficult to control what users use it for .. (there will be people starting to use .array to get a numpy array, the only way to avoid this is to raise an error in such a case and only return ExtensionArrays, but that of course defeats the general usecase of the property).

which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values

I suppose that returning DatetimeArray instead of ndarray[datetime64[ns]] will be more future proof. But, that also relates to what I commented above (#23623 (comment)) related the back-compat guarantees we make. If we want to keep following the rule in the future, I have the feeling that we will need to change the return value at some point.

I'd love to have a clear rule for .array:

   It's an ndarray for builtin types supported by numpy.
   It's an extension array for pandas extension arrays.

Some other possibilities that would go further than the above rule:

It's an extension array if the round-trip (eg Series -> array -> Series) with ndarray would not be completely information-preserving and completely cheap.
- In practice this might be the same as the above rule at the moment, but eg a possible future StringArray (wrapping an object array of strings but guaranteeing every element is a string) could be converted rather faithfully to an object array, but the roundtrip to ndarray would loose some information (the fact every element is a string). For DatetimeArray you could say it looses the freq attribute, but since this is an attribute on the Array, and not metadata of the dtype, I would say this is less of a problem.
It's an extension array if a ndarray would not support the same operations as a Series holding the data
- For eg DatetimeArray vs ndarray[datetime64], not all arithmetic and other operations will yield the same results, and one could expect that Series[datetime] and Series[datetime].array behave the same

The rule might also whether it is implemented under the hood as an ExtensionArray or not. But this is course not necessarily clear to the user (it might be hidden somewhat), and @shoyer I assume that your proposal for a rule would be to have a clear expectation for the user? (so all the above rules that I mention would be more complicated).

shoyer · 2018-11-27T20:18:19Z

Would it make sense to make to consider having .array always return an ExtensionArray object?

I imagine we could pretty quickly whip up an ExtensionArray for each NumPy dtype that simply defers to the underlying NumPy array for every operation. Internally, we could recognize these NumpyExtensionArray objects and use base numpy operations.

I guess we could also just clearly document: .array means you are going into pandas' internals and is not guaranteed to be stable. We may change the return value from .array in the future.

jorisvandenbossche

OK, had the time to go through the full diff, and added some comments (and thanks a lot for the PR Tom!)

Additional comments:

to what extent do we also want to mention np.(as)array(..) as alternative to .to_numpy() in the docs?
I think we need to keep in some places the explanation about values, since it is not yet going away and users will encounter it in code (or at least mention something about it existing for historical reasons and DataFrame.values being equivalent to to_numpy, but not recommended anymore)
I would update the Series.values docstring as well, to add a note about its recommendation status, and to refer to array/to_numpy (similar as what you did for DataFrame/Index.values)

jorisvandenbossche · 2018-11-27T10:07:51Z

doc/source/10min.rst

+casting every value to a Python object.
+
+For ``df``, our :class:`DataFrame` of all floating-point values,
+:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.


Reading this, should we have a copy keyword to be able to force a copy? (can be added later)

This is a good idea. Don't care whether we do it here or later.

I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.

Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64

Regarding copy, would it actually make sense to have copy=True the default? Then you have at least a consistent default (it is never a view on the data)

Yes, I think copy=True is a good default since it's the only one that can be ensured for all cases.

doc/source/basics.rst

jorisvandenbossche · 2018-11-27T20:17:27Z

doc/source/basics.rst

+period (time spans) :class:`PeriodDtype`      :class:`Period`    :class:`arrays.PeriodArray`   :ref:`timeseries.periods`
+sparse              :class:`SparseDtype`      (none)             :class:`arrays.SparseArray`   :ref:`sparse`
+intervals           :class:`IntervalDtype`    :class:`Interval`  :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
+nullable integer    :clsas:`Int64Dtype`, ...  (none)             :class:`arrays.IntegerArray`  :ref:`integer_na`


where does this 'integer_na' point to? (I don't seem to find it in the docs)

#23617. I'm aiming for eventual consistency on the docs :)

pandas/core/frame.py

pandas/core/generic.py

pandas/core/indexes/base.py

pandas/tests/test_base.py

jorisvandenbossche · 2018-11-27T20:50:04Z

pandas/tests/test_base.py

+        pytest.skip("No index type for {}".format(array.dtype))
+
+    result = thing.to_numpy()
+    tm.assert_numpy_array_equal(result, expected)


Should we also test for the case where it is not a copy?

What do you mean here? (in case you missed it, the first case is a regular ndarray, so that won't be a copy. Though perhaps you're saying I should assert this for that case?)

Yes, that's what I meant. If we return a view, and people can rely on it, we should test it.

jorisvandenbossche · 2018-11-27T20:53:55Z

@shoyer I think the main problem is that would basically already get rid off the block-based internals? And I think that was more a change we were contemplating for 2.0 instead of 1.0. (apart from the additional work it would mean on the short term)
(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

shoyer · 2018-11-27T21:21:30Z

(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

This is all I was thinking of.

TomAugspurger · 2018-11-28T15:28:44Z

Hmm, having .array always be an ExtensionArray is an interesting proposal... It kind of makes ".array is the actual array stored in the Series" a lie, but maybe users don't care about that? I assume they care more about things like zero-copy and inplace modification, than they do about how pandas chooses to handle a particular dtype.

TomAugspurger · 2018-11-29T12:23:14Z

What do people think about doing the remaining items as followup?

Determine the signature for .to_numpy() (copy=True is uncontroversial. the rest I'm not sure about, but we could figure it out and do here.)
Finalize DatetimeArray vs. ndarray for tz aware and naive
Explore .array always being an Array.

I think 2 and 3 will be easier to think about once we have a DatetimeArray.

jreback · 2018-11-29T12:50:10Z

#23623 (comment)

certainly fine as a followup. I am not sure 3 is actually a blocker for 0.24.0 (though 2 is) and 1) as-is is fine for now

jreback · 2018-11-29T12:50:16Z

ping on green.

TomAugspurger · 2018-11-29T15:47:34Z

All green.

jreback · 2018-11-29T15:53:25Z

thanks @TomAugspurger very nice!

…-dev#23623)

TomAugspurger added 5 commits November 6, 2018 15:28

update

5b15894

Merge remote-tracking branch 'upstream/master' into public-data

4781a36

more notes

15cc0b7

update

888853f

TomAugspurger added the API Design label Nov 11, 2018

TomAugspurger added this to the 0.24.0 milestone Nov 11, 2018

Merge remote-tracking branch 'upstream/master' into public-data

2cfca30

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018

BUG: Ensure that Index._data is an ndarray

9d3b4bb

Split from pandas-dev#23623, where it was causing issues with infer_dtype.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018

BUG: Ensure that Index._data is an ndarray

1f23ebc

BUG: Ensure that Index._data is an ndarray Split from pandas-dev#23623, where it was causing issues with infer_dtype.

TomAugspurger mentioned this pull request Nov 11, 2018

Ensure Index._data is an ndarray #23628

Merged

TomAugspurger added 3 commits November 13, 2018 09:06

Merge remote-tracking branch 'upstream/master' into public-data

3e76f02

DOC: updated docs

bceb612

jreback reviewed Nov 14, 2018

View reviewed changes

doc/source/basics.rst Show resolved Hide resolved

TomAugspurger added 3 commits November 17, 2018 14:36

Added DataFrame.to_numpy

c19c9bb

Merge remote-tracking branch 'upstream/master' into public-data

fe813ff

clean

8619790

jreback requested changes Nov 17, 2018

View reviewed changes

doc/source/basics.rst Show resolved Hide resolved

doc/source/dsintro.rst Outdated Show resolved Hide resolved

doc/source/dsintro.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v0.24.0.rst Show resolved Hide resolved

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

TomAugspurger added 4 commits November 21, 2018 06:27

Merge remote-tracking branch 'upstream/master' into public-data

639b6fb

doc update

95f19bc

Merge remote-tracking branch 'upstream/master' into public-data

3292e43

update

5a905ab

jreback reviewed Nov 21, 2018

View reviewed changes

jreback approved these changes Nov 21, 2018

View reviewed changes

jorisvandenbossche reviewed Nov 27, 2018

View reviewed changes

TomAugspurger added 4 commits November 28, 2018 09:11

Fixup for feedback

661b9eb

Merge remote-tracking branch 'upstream/master' into public-data

52f5407

skip only on index box

566a027

Series.values

062c49f

TomAugspurger added 3 commits November 28, 2018 16:30

Merge remote-tracking branch 'upstream/master' into public-data

78e5824

remove stale todo

e805c26

Merge remote-tracking branch 'upstream/master' into public-data

f9eee65

jreback merged commit 0a4f40c into pandas-dev:master Nov 29, 2018

TomAugspurger deleted the public-data branch November 29, 2018 15:59

TomAugspurger mentioned this pull request Nov 29, 2018

Public Data Followups #23995

Closed

3 tasks

TomAugspurger mentioned this pull request Dec 8, 2018

Add a docs section on .values #22857

Closed

h-vetinari mentioned this pull request Jan 3, 2019

DOC: update str.cat example #23723

Merged

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

API: Public data for Series and Index: .array and .to_numpy() (pandas…

50ef066

…-dev#23623)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

API: Public data for Series and Index: .array and .to_numpy() (pandas…

87a5385

…-dev#23623)

jeffreykennethli mentioned this pull request Jan 19, 2022

FIX-#3896: Change Series.values to default to to_numpy(). modin-project/modin#3979

Merged

7 tasks

jameslamb mentioned this pull request Jun 23, 2023

[python-package] add 'pandas' extra microsoft/LightGBM#5937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Public data for Series and Index: .array and .to_numpy() #23623

API: Public data for Series and Index: .array and .to_numpy() #23623

TomAugspurger commented Nov 11, 2018

pep8speaks commented Nov 11, 2018

jbrockmendel commented Nov 11, 2018

TomAugspurger commented Nov 11, 2018 via email

jbrockmendel commented Nov 11, 2018

codecov bot commented Nov 13, 2018 •

edited

Loading

jreback Nov 21, 2018

TomAugspurger Nov 21, 2018

shoyer commented Nov 27, 2018

TomAugspurger commented Nov 27, 2018 •

edited

Loading

shoyer commented Nov 27, 2018

jorisvandenbossche commented Nov 27, 2018

shoyer commented Nov 27, 2018

jorisvandenbossche left a comment

jorisvandenbossche Nov 27, 2018

TomAugspurger Nov 28, 2018

jorisvandenbossche Nov 28, 2018

TomAugspurger Nov 28, 2018

jorisvandenbossche Nov 27, 2018

TomAugspurger Nov 28, 2018

jorisvandenbossche Nov 27, 2018

TomAugspurger Nov 28, 2018

jorisvandenbossche Nov 28, 2018

jorisvandenbossche commented Nov 27, 2018

shoyer commented Nov 27, 2018

TomAugspurger commented Nov 28, 2018

TomAugspurger commented Nov 29, 2018

jreback commented Nov 29, 2018

jreback commented Nov 29, 2018

TomAugspurger commented Nov 29, 2018

jreback commented Nov 29, 2018

API: Public data for Series and Index: .array and .to_numpy() #23623

API: Public data for Series and Index: .array and .to_numpy() #23623

Conversation

TomAugspurger commented Nov 11, 2018

pep8speaks commented Nov 11, 2018

jbrockmendel commented Nov 11, 2018

TomAugspurger commented Nov 11, 2018 via email

jbrockmendel commented Nov 11, 2018

codecov bot commented Nov 13, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 27, 2018

TomAugspurger commented Nov 27, 2018 • edited Loading

shoyer commented Nov 27, 2018

jorisvandenbossche commented Nov 27, 2018

shoyer commented Nov 27, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 27, 2018

shoyer commented Nov 27, 2018

TomAugspurger commented Nov 28, 2018

TomAugspurger commented Nov 29, 2018

jreback commented Nov 29, 2018

jreback commented Nov 29, 2018

TomAugspurger commented Nov 29, 2018

jreback commented Nov 29, 2018

codecov bot commented Nov 13, 2018 •

edited

Loading

TomAugspurger commented Nov 27, 2018 •

edited

Loading