Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API / internals: exact semantics of _ndarray_values #23565

Closed
jorisvandenbossche opened this issue Nov 8, 2018 · 4 comments · Fixed by #32768
Closed

API / internals: exact semantics of _ndarray_values #23565

jorisvandenbossche opened this issue Nov 8, 2018 · 4 comments · Fixed by #32768
Labels
API Design Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jorisvandenbossche
Copy link
Member

We need to better describe the exact semantics of _ndarray_values: what is it expected to return and how it is used.

Currenlty it is defined on the ExtensionArray, but mentioned it is not part of the "official" interface:

@property
def _ndarray_values(self):
# type: () -> np.ndarray
"""Internal pandas method for lossy conversion to a NumPy ndarray.
This method is not part of the pandas interface.
The expectation is that this is cheap to compute, and is primarily
used for interacting with our indexers.
"""
return np.array(self)

One Series/Index, the property will either give you what EA._ndarray_values gives, or the underlying ndarray:

pandas/pandas/core/base.py

Lines 768 to 780 in 712fa94

@property
def _ndarray_values(self):
# type: () -> np.ndarray
"""The data as an ndarray, possibly losing information.
The expectation is that this is cheap to compute, and is primarily
used for interacting with our indexers.
- categorical -> codes
"""
if is_extension_array_dtype(self):
return self.values._ndarray_values
return self.values


What it currently is for the EAs:

  • Categorical: integer codes
  • IntegerArray: the integer _data, so but losing any information about missing values
  • PeriodArray: the integer ordinals
  • IntervalIndex: object array of Interval objects

For what it is currently used (this needs to be better looked at, copying now from #19954 (comment), quoting Tom here):

  • Index.itemsize (deprecated)
  • Index.strides (deprecated)
  • Index._engine
  • Index set ops
  • Index.insert
  • DatetimeIndex.unique
  • MultiIndex.equals
  • pytables._convert_index (shared across integer and period)

There are a few other uses (mostly datetime / timedelta / period) that could maybe uses asi8 instead. I'm not familiar enough with indexing to know whether that can operate on something other than ndarrays. In theory, EAs can implement the buffer protocol, which would get the data to cython. But I don't know what ops would be required when we're down there.

@jorisvandenbossche jorisvandenbossche added API Design Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action labels Nov 8, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Nov 8, 2018
@jorisvandenbossche
Copy link
Member Author

I think a highly related topic is my exploration of a generic ExtensionIndex in #23223, since we are indeed using _ndarray_values in many places for indexing.

I need to take a more detailed look at what the actual requirements are for indexing values (regarding missing values, round-trip-ability, ..)

Another question is to what extent _ndarray_values is actually different from _values_for_factorize in the ExtensionArray interface.

@jbrockmendel
Copy link
Member

For the itemsize/strides/ etc listed above, I think it's worthwhile seeing for how many of those we could use just _values or values or _data.

+1 on looking at round-trip and values_for_factorize

@TomAugspurger
Copy link
Contributor

Is there anything to do here for 0.24.0?

@TomAugspurger
Copy link
Contributor

Pushing to 0.25

@TomAugspurger TomAugspurger modified the milestones: 0.24.0, 0.25.0 Dec 13, 2018
@jreback jreback modified the milestones: 0.25.0, 1.0 May 12, 2019
@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants