Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Public data for Series and Index: .array and .to_numpy() #23623

Merged
merged 28 commits into from
Nov 29, 2018

Conversation

TomAugspurger
Copy link
Contributor

Closes #19954.

TODO:

  • update references to .values in the docs to use .array or .to_numpy()
  • cross ref between .values and the rest

This adds two new methods for working with EA-backed Series / Index.

- `.array -> Union[ExtensionArray, ndarray]`: the actual backing array
- `.to_numpy() -> ndarray`: A NumPy representation of the data

`.array` is always a reference to the actual data stored in the container.
Updating it inplace (not recommended) will be reflected in the Series (or
Index for that matter, so really not recommended).

`to_numpy()` may (or may not) require data copying / coercion.

Closes pandas-dev#19954
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 11, 2018
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@jbrockmendel
Copy link
Member

Not having read the diff, first thought is that I’d like to wait for DatetimeArray to see what we can simplify vis-a-vis values/_values

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 11, 2018 via email

@jbrockmendel
Copy link
Member

Not at all a well thought out opinion, just gut reaction.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018
Split from pandas-dev#23623, where it was
causing issues with infer_dtype.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018
BUG: Ensure that Index._data is an ndarray

Split from pandas-dev#23623, where it was
causing issues with infer_dtype.
commit e4b21f6
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 16:09:58 2018 -0600

    TST: Change rops tests

commit e903550
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 09:31:38 2018 -0600

    Add note

    [ci skip]

    ***NO CI***

commit fa8934a
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 06:16:53 2018 -0600

    update errors

commit 505970e
Merge: a30bc02 3592a46
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 05:55:31 2018 -0600

    Merge remote-tracking branch 'upstream/master' into index-ndarray-data

commit a30bc02
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Sun Nov 11 15:14:46 2018 -0600

    remove assert

commit 1f23ebc
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Sun Nov 11 15:01:13 2018 -0600

    BUG: Ensure that Index._data is an ndarray

    BUG: Ensure that Index._data is an ndarray

    Split from pandas-dev#23623, where it was
    causing issues with infer_dtype.
@codecov
Copy link

codecov bot commented Nov 13, 2018

Codecov Report

Merging #23623 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23623      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51513    51526      +13     
==========================================
+ Hits        47554    47567      +13     
  Misses       3959     3959
Flag Coverage Δ
#multiple 90.71% <100%> (ø) ⬆️
#single 42.48% <50%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 93.68% <ø> (ø) ⬆️
pandas/core/generic.py 96.84% <ø> (ø) ⬆️
pandas/core/indexes/multi.py 95.51% <100%> (+0.01%) ⬆️
pandas/core/base.py 97.64% <100%> (+0.03%) ⬆️
pandas/core/frame.py 97.03% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.48% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b2dab...f9eee65. Read the comment docs.

doc/source/basics.rst Show resolved Hide resolved
doc/source/dsintro.rst Outdated Show resolved Hide resolved
doc/source/dsintro.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.rst Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
@@ -1269,3 +1269,54 @@ def test_ndarray_values(array, expected):
r_values = pd.Index(array)._ndarray_values
tm.assert_numpy_array_equal(l_values, r_values)
tm.assert_numpy_array_equal(l_values, expected)


@pytest.mark.parametrize("array, attr", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put in pandas/tests/arrays/test_arrays.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't exist yet :), though my pd.array PR is creating it.

This seemed a bit more appropriate since it's next to our tests for ndarray_values.

@shoyer
Copy link
Member

shoyer commented Nov 27, 2018

Do time-zone naive datetimes use extension arrays?

I'd love to have a clear rule for .array:

  • It's an ndarray for builtin types supported by numpy.
  • It's an extension array for pandas extension arrays.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 27, 2018

Do time-zone naive datetimes use extension arrays?

That's being worked on right now. It's unclear to me how things will end up, but it's possible there will be an ExtensionArray between Series and the actual ndarray for tz-naive data. This will certainly be the case for DatetimeIndex.

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

I'd love to have a clear rule for .array:

  • It's an ndarray for builtin types supported by numpy.
  • It's an extension array for pandas extension arrays.

I like this rule.

@shoyer
Copy link
Member

shoyer commented Nov 27, 2018

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

In that case I suppose it mostly comes down to:

  1. which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?
  2. which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values.

@jorisvandenbossche
Copy link
Member

which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?

"Ideally", if a user wants a numpy array, they use to_numpy or np.array(..), and they should only use .array if they don't really care about the distinction between both, but want to do a certain operation on the values (eg an operation without alignment and then put it back in a Series).
But of course, it is difficult to control what users use it for .. (there will be people starting to use .array to get a numpy array, the only way to avoid this is to raise an error in such a case and only return ExtensionArrays, but that of course defeats the general usecase of the property).

which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values

I suppose that returning DatetimeArray instead of ndarray[datetime64[ns]] will be more future proof. But, that also relates to what I commented above (#23623 (comment)) related the back-compat guarantees we make. If we want to keep following the rule in the future, I have the feeling that we will need to change the return value at some point.

I'd love to have a clear rule for .array:

   It's an ndarray for builtin types supported by numpy.
   It's an extension array for pandas extension arrays.

Some other possibilities that would go further than the above rule:

  • It's an extension array if the round-trip (eg Series -> array -> Series) with ndarray would not be completely information-preserving and completely cheap.

    • In practice this might be the same as the above rule at the moment, but eg a possible future StringArray (wrapping an object array of strings but guaranteeing every element is a string) could be converted rather faithfully to an object array, but the roundtrip to ndarray would loose some information (the fact every element is a string). For DatetimeArray you could say it looses the freq attribute, but since this is an attribute on the Array, and not metadata of the dtype, I would say this is less of a problem.
  • It's an extension array if a ndarray would not support the same operations as a Series holding the data

    • For eg DatetimeArray vs ndarray[datetime64], not all arithmetic and other operations will yield the same results, and one could expect that Series[datetime] and Series[datetime].array behave the same

The rule might also whether it is implemented under the hood as an ExtensionArray or not. But this is course not necessarily clear to the user (it might be hidden somewhat), and @shoyer I assume that your proposal for a rule would be to have a clear expectation for the user? (so all the above rules that I mention would be more complicated).

@shoyer
Copy link
Member

shoyer commented Nov 27, 2018

Would it make sense to make to consider having .array always return an ExtensionArray object?

I imagine we could pretty quickly whip up an ExtensionArray for each NumPy dtype that simply defers to the underlying NumPy array for every operation. Internally, we could recognize these NumpyExtensionArray objects and use base numpy operations.

I guess we could also just clearly document: .array means you are going into pandas' internals and is not guaranteed to be stable. We may change the return value from .array in the future.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, had the time to go through the full diff, and added some comments (and thanks a lot for the PR Tom!)

Additional comments:

  • to what extent do we also want to mention np.(as)array(..) as alternative to .to_numpy() in the docs?
  • I think we need to keep in some places the explanation about values, since it is not yet going away and users will encounter it in code (or at least mention something about it existing for historical reasons and DataFrame.values being equivalent to to_numpy, but not recommended anymore)
  • I would update the Series.values docstring as well, to add a note about its recommendation status, and to refer to array/to_numpy (similar as what you did for DataFrame/Index.values)

casting every value to a Python object.

For ``df``, our :class:`DataFrame` of all floating-point values,
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, should we have a copy keyword to be able to force a copy? (can be added later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Don't care whether we do it here or later.

I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64

Regarding copy, would it actually make sense to have copy=True the default? Then you have at least a consistent default (it is never a view on the data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think copy=True is a good default since it's the only one that can be ensured for all cases.

doc/source/basics.rst Show resolved Hide resolved
doc/source/basics.rst Outdated Show resolved Hide resolved
doc/source/basics.rst Show resolved Hide resolved
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this 'integer_na' point to? (I don't seem to find it in the docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23617. I'm aiming for eventual consistency on the docs :)

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
pandas/tests/test_base.py Show resolved Hide resolved
pytest.skip("No index type for {}".format(array.dtype))

result = thing.to_numpy()
tm.assert_numpy_array_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test for the case where it is not a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here? (in case you missed it, the first case is a regular ndarray, so that won't be a copy. Though perhaps you're saying I should assert this for that case?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. If we return a view, and people can rely on it, we should test it.

@jorisvandenbossche
Copy link
Member

@shoyer I think the main problem is that would basically already get rid off the block-based internals? And I think that was more a change we were contemplating for 2.0 instead of 1.0. (apart from the additional work it would mean on the short term)
(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

@shoyer
Copy link
Member

shoyer commented Nov 27, 2018

(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

This is all I was thinking of.

@TomAugspurger
Copy link
Contributor Author

Hmm, having .array always be an ExtensionArray is an interesting proposal... It kind of makes ".array is the actual array stored in the Series" a lie, but maybe users don't care about that? I assume they care more about things like zero-copy and inplace modification, than they do about how pandas chooses to handle a particular dtype.

@TomAugspurger
Copy link
Contributor Author

What do people think about doing the remaining items as followup?

  1. Determine the signature for .to_numpy() (copy=True is uncontroversial. the rest I'm not sure about, but we could figure it out and do here.)
  2. Finalize DatetimeArray vs. ndarray for tz aware and naive
  3. Explore .array always being an Array.

I think 2 and 3 will be easier to think about once we have a DatetimeArray.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

#23623 (comment)

certainly fine as a followup. I am not sure 3 is actually a blocker for 0.24.0 (though 2 is) and 1) as-is is fine for now

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

ping on green.

@TomAugspurger
Copy link
Contributor Author

All green.

@jreback jreback merged commit 0a4f40c into pandas-dev:master Nov 29, 2018
@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

thanks @TomAugspurger very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants