Array Interface and Categorical internals Refactor #19268

TomAugspurger · 2018-01-16T15:25:09Z

(edit post categorical-move)

Rebased on master. Summary of the changes from master:

Added the ExtensionArray class
Categorical subclasses ExtensionArray

Implements the new methods for the interface (all private. No public API
changes)

Adapted the ExtensionDtype class to be the public ABC
a. Subclass that with PandasExtensionClass that does non-interface things
like reprs, caching, etc.
b. All our custom dtypes inherit from PandasExtensionClass, so they implement
the interface.
Internals Changes:
a. Added an ExtensionBlock. This will be a parent for our current custom
blocks, and the block type for all 3rd-party extension arrays.
Added a new is_extension_array_dtype method. I think this is nescessary
for now, until we've handled DatetimeTZ.

This isn't really a test of whether extension arrays work yet, since we're
still using Categorical for everything. I have a followup PR that implements
an IntervalArray that requires additional changes to, e.g., the constructors
so that things work. But all the changes from core/internals.py required to
make that work are present here.

New class hierarchy in internals

Old:

class CategoricalBlock(NonConsolidatableMixin, ObjectBlock):
    pass

new:

class ExtensionBlock(NonConsolidatableMixin, Block):
   pass

class CategoricalBlock(ExtensionBlock):
    pass

Figuring out which methods of ObjectBlock were required on CategoricalBlock
wasn't trivial for me. I probably messed some up.

I think that eventually we can remove NonConsolidatableMixin, with the idea
that all non-consolidatable blocks are blocks for extension dtypes? That's true
today anyway.

Followup PRs:

Making core/arrays/period.py and refactoring PeriodIndex
Making core/arrays/interval.py and refactoring IntervalIndex
Adding docs and generic tests like https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff for our interface (once it's stabilized a bit).

pep8speaks · 2018-01-16T15:25:27Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 01, 2018 at 20:55 Hours UTC

TomAugspurger · 2018-01-16T15:31:39Z

pandas/core/arrays/base.py

+            Should return an ExtensionArray, even if ``self[slicer]``
+            would return a scalar.
+        """
+        # XXX: We could get rid of this *if* we require that


We can almost get rid of _slice with a default implementation noted in the comments, but I think dimensionality reduction from array -> scalar could break things.

Assuming slicer is a slice, in what cases can it give a scalar?
In principle this could just be handled in getitem ?

slice(0) or slice(0, 1) hit` this.

Ahh, but I didn't realize that NumPy always returned arrays from slices:

In [5]: np.array([0])[slice(0, 1)] Out[5]: array([0])

So if we assume that as well, then we could get rid of _slice.

Yes, I think we can assume that if __getitem__ gets a slice object, it should return an instance of itself.

TomAugspurger · 2018-01-16T15:32:35Z

pandas/core/dtypes/base.py

+    @property
+    def type(self):
+        """Typically a metaclass inheriting from 'type' with no methods."""
+        return type(self.name, (), {})


I'm not really sure what array.dtype.type is used for. This passes the test suite, but may break things like

array1.dtype.type is array2.dtype.type

since the object IDs will be different (I think).

In NumPy, dtype.type is the corresponding scalar type, e.g.,

>>> np.dtype(np.float64).type numpy.float64

I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.

Thanks, that makes sense. Would you say that object is a good default here? I'll work through a test array that uses something meaningful like numbers.Real or int. We do use values.dtype.type in a few places, like figuring out which block type to use.

I don't think there's a good default value here. Generally the right choice is to return the corresponding scalar type, e.g., Interval for IntervalDtype.

TomAugspurger · 2018-01-16T15:35:02Z

pandas/core/internals.py

@@ -108,14 +109,15 @@ class Block(PandasObject):
    def __init__(self, values, placement, ndim=None, fastpath=False):
        if ndim is None:
            ndim = values.ndim
-        elif values.ndim != ndim:
+        elif self._validate_ndim and values.ndim != ndim:


@jreback would be curious to hear your thoughts here. Needed this so that ExtensionBlock could inherit from Block and call super().__init__. We could avoid this by setting .values, .mgr_locs, .ndim directly in ExtensionBlock.__init__, but I think it's best practice to always call your parent's init.

yes, see my above comment

TomAugspurger · 2018-01-16T15:36:11Z

pandas/core/internals.py

+        # Placement must be converted to BlockPlacement so that we can check
+        # its length
+        if not isinstance(placement, BlockPlacement):
+            placement = BlockPlacement(placement)


I found it much clearer to just copy these two lines rather than calling mgr_locs.setter, but can revert to that if desired.

TomAugspurger · 2018-01-16T15:36:44Z

pandas/core/internals.py

@@ -2360,23 +2443,13 @@ def is_view(self):
    def to_dense(self):
        return self.values.to_dense().view()

-    def convert(self, copy=True, **kwargs):


In the ExtensionBlock now.

TomAugspurger · 2018-01-16T15:36:51Z

pandas/core/internals.py

    @property
    def array_dtype(self):
        """ the dtype to return if I want to construct this block as an
        array
        """
        return np.object_

-    def _slice(self, slicer):


In ExtensionBlock now.

TomAugspurger · 2018-01-16T15:37:29Z

pandas/core/internals.py

@@ -2468,7 +2541,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block):
    _can_hold_na = True

    def __init__(self, values, placement, fastpath=False, **kwargs):
-        if values.dtype != _NS_DTYPE:
+        if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE:
+            # not datetime64 or datetime64tz


This was so NonConsolidatableMixin could call __init__. I think it's harmless.

TomAugspurger · 2018-01-16T15:38:36Z

pandas/core/internals.py

+
+    # Methods we can (probably) ignore and just use Block's:
+
+    # * replace / replace_single


Any things on this @jreback? CategoricalBlock.replace used to be ObjectBlock.replace, but now it's just Block.replace, which is much simpler.

TomAugspurger · 2018-01-16T15:41:10Z

pandas/core/internals.py

+            values = values.reshape((1,) + values.shape)
+        return values
+
+    def _can_hold_element(self, element):


And @jreback I'll defer to you on this one, since you know more about when this is called. I don't think we can just do return isinstance(element, self._holder), since element may need to be coerced. For an IP address, we want

ser = pd.Series(ip.IPAddress(['192.168.1.1'])) ser[0] = '192.168.1.0'

to work, and (I think) element would just be the str there.

jorisvandenbossche · 2018-01-16T15:45:14Z

@TomAugspurger for reviewing, did you need to change a lot to Categorical implementation itself? (because you moved it hard to see ..)

jreback · 2018-01-16T15:46:50Z

i would much prefer that moves occur separately from changes

TomAugspurger · 2018-01-16T15:49:12Z

I did keep the move in 4b06ae4

All the non-move changes to categorical are in a9e0972#diff-f3b2ea15ba728b55cab4a1acd97d996d

TomAugspurger · 2018-01-16T15:50:16Z

But I can split the move into its own PR as well.

One thing: do we have code for pickle compat on module changes? The old pickle files expect a pandas.core..categorical.Categorical. I threw a shim in there, but would be nice to remove it.

TomAugspurger · 2018-01-16T16:01:32Z

Splitting these PRs now (I'll make a new PR for the move and then rebase this one)

TomAugspurger · 2018-01-16T16:08:31Z

#19269 for the move.

codecov · 2018-01-18T18:24:15Z

Codecov Report

Merging #19268 into master will increase coverage by <.01%.
The diff coverage is 76%.

@@            Coverage Diff             @@
##           master   #19268      +/-   ##
==========================================
+ Coverage   91.62%   91.62%   +<.01%     
==========================================
  Files         150      150              
  Lines       48726    48672      -54     
==========================================
- Hits        44643    44598      -45     
+ Misses       4083     4074       -9

Flag	Coverage Δ
#multiple	`89.99% <76%> (ø)`	⬆️
#single	`41.74% <62.85%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/common.py	`93.41% <ø> (+0.49%)`	⬆️
pandas/core/dtypes/dtypes.py	`96.08% <100%> (-0.04%)`	⬇️
pandas/core/arrays/__init__.py	`100% <100%> (ø)`	⬆️
pandas/core/dtypes/common.py	`95.37% <100%> (+0.05%)`	⬆️
pandas/core/dtypes/base.py	`47.61% <47.61%> (ø)`
pandas/core/arrays/base.py	`56.66% <56.66%> (ø)`
pandas/core/arrays/categorical.py	`94.74% <66.66%> (-1.04%)`	⬇️
pandas/core/internals.py	`95.05% <85.39%> (-0.42%)`	⬇️
pandas/errors/__init__.py	`92.3% <85.71%> (-7.7%)`	⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb3b237...34134f2. Read the comment docs.

TomAugspurger · 2018-01-18T18:25:28Z

Rebased on master. Summary of the changes from master:

Added the ExtensionArray class
Categorical subclasses ExtensionArray

Implements the new methods for the interface (all private. No public API
changes)

Adapted the ExtensionDtype class to be the public ABC
a. Subclass that with PandasExtensionClass that does non-interface things
like reprs, caching, etc.
b. All our custom dtypes inherit from PandasExtensionClass, so they implement
the interface.
Internals Changes:
a. Added an ExtensionBlock. This will be a parent for our current custom
blocks, and the block type for all 3rd-party extension arrays.
Added a new is_extension_array_dtype method. I think this is nescessary
for now, until we've handled DatetimeTZ.

This isn't really a test of whether extension arrays work yet, since we're
still using CategoricalBlock for everything. I have a followup PR that implements
an IntervalArray that requires additional changes to, e.g., the constructors
so that things work. But all the changes from core/internals.py required to
make that work are present here.

TomAugspurger · 2018-01-18T18:31:40Z

@shoyer, @chris-b1, @wesm if you're interested in the interface, this is the PR it's being defined in, though we may have to make refinements later.

I have followup PRs ready for IntervalArray and PeriodArray that build on this. Docs and such will come in the IntervalArray PR, once we have publicly visible changes.

shoyer · 2018-01-18T18:44:22Z

pandas/core/arrays/base.py

+    # ------------------------------------------------------------------------
+    @property
+    def base(self):
+        """The base array I am a view of. None by default."""


Can you give an example here?

Perhaps it would also help to explain how is this used by pandas?

It's used in Block.is_view, which AFAICT is only used for chained assignment?

If that's correct, then I think we're OK with saying this purely for compatibility with NumPy arrays, and has no effect. I've currently defined ExtensionArray.is_view to always be False, so I don't even make use of it in the changes so far.

If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

I would also remove this. NumPy doesn't always maintain this properly, so it can't actually be essential.

shoyer · 2018-01-18T18:45:27Z

pandas/core/arrays/base.py

+    @abc.abstractmethod
+    def take(self, indexer, allow_fill=True, fill_value=None):
+        # type: (Sequence, bool, Optional[Any]) -> ExtensionArray
+        """For slicing"""


This should clarify what valid values of indexer are. Does -1 indicate a fill value?

shoyer · 2018-01-18T18:45:33Z

pandas/core/arrays/base.py

+
+    def take_nd(self, indexer, allow_fill=True, fill_value=None):
+        """For slicing"""
+        # TODO: this isn't really nescessary for 1-D


I would remove this

shoyer · 2018-01-18T18:46:21Z

pandas/core/arrays/base.py

+
+    @property
+    def is_sparse(self):
+        """Whether your array is sparse. True by default."""


Correction: False by default :)

This should clarify what it means to be a sparse array. How does pandas treat sparse arrays differently?

I would consider dropping this if it isn't strictly necessary.

I think it's unnecessary.

shoyer · 2018-01-18T18:48:33Z

pandas/core/arrays/base.py

+            Should return an ExtensionArray, even if ``self[slicer]``
+            would return a scalar.
+        """
+        return type(self)(self[slicer])


This default implementation is likely to fail for some obvious implementations. Perhaps we can have a constructor method _from_scalar() instead that converts a scalar into a length 1 array?

Let me see if I can verify that this is always called with a slice object. In that case, __getitem__ will return an ExtensionArray, and we don't have to worry about the scalar case. Unless I'm missing something.

Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).

shoyer · 2018-01-18T18:50:20Z

pandas/core/arrays/base.py

+
+
+@add_metaclass(abc.ABCMeta)
+class ExtensionArray(object):


Are there any expected requirements for the constructor __init__?

Yeah, we should figure out what those are and document them. At the very least, we expected ExtensionArray(extension_array) to work correctly. I'll look for other assumptions we make. Or that could be pushed to another classmethod.

We also expect that ExtensionArray(), with no arguments, works so that subclasses don't have to implement construct_from_string.

Rather than imposing that on subclasses, we could require some kind of .empty alternative constructor.

shoyer · 2018-01-18T18:51:28Z

pandas/core/arrays/base.py

+    def dtype(self):
+        """An instance of 'ExtensionDtype'."""
+        # type: () -> ExtensionDtype
+        pass


Please drop pass from all these methods. It's not needed (docstrings alone suffice).

shoyer · 2018-01-18T18:52:01Z

pandas/core/arrays/base.py

+    @abc.abstractmethod
+    def nbytes(self):
+        """The number of bytes needed to store this object in memory."""
+        # type: () -> int


Type comments come before the docstring: http://mypy.readthedocs.io/en/latest/python2.html

shoyer · 2018-01-18T18:55:05Z

pandas/core/dtypes/base.py

+    @property
+    def type(self):
+        """Typically a metaclass inheriting from 'type' with no methods."""
+        return type(self.name, (), {})


In NumPy, dtype.type is the corresponding scalar type, e.g.,

>>> np.dtype(np.float64).type numpy.float64

I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.

shoyer · 2018-01-18T18:56:55Z

pandas/core/dtypes/base.py

+
+    @property
+    def kind(self):
+        """A character code (one of 'biufcmMOSUV'), default 'O'


This should clarify how it's used. How is this useful?

Perhaps "This should match dtype.kind when arrays with this dtype are cast to numpy arrays"?

* removed take_nd * Changed to_dense to return get_values * Fixed docstrings, types * Removed is_sparse

jreback · 2018-01-18T23:32:06Z

pandas/core/dtypes/dtypes.py


-class ExtensionDtype(object):
+
+class PandasExtensionDtype(ExtensionDtype):


why would this not be just ExtensionDtype

We define methods like __repr__, caching, etc. that are not part of the interface.

repr is reasonable to be part of the interface, caching I suppose is ok here

jreback · 2018-01-18T23:32:38Z

pandas/core/internals.py

@@ -95,6 +96,7 @@ class Block(PandasObject):
    is_object = False
    is_categorical = False
    is_sparse = False
+    is_extension = False


you wouldn't do it like this. rather you would inherit from an ExtensionBlock

I haven't looked at what these is_ properties are used for, but all the other blocks had them so I added it for consistency.

let's not add things like this until / unless necessary

jreback · 2018-01-18T23:33:00Z

pandas/core/internals.py

@@ -108,14 +109,15 @@ class Block(PandasObject):
    def __init__(self, values, placement, ndim=None, fastpath=False):
        if ndim is None:
            ndim = values.ndim
-        elif values.ndim != ndim:
+        elif self._validate_ndim and values.ndim != ndim:


yes, see my above comment

jreback · 2018-01-18T23:34:02Z

pandas/core/internals.py

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
        return blocks, mask


+class ExtensionBlock(NonConsolidatableMixIn, Block):


this should NOT be in this file, make a dedicated namespace. pandas.core.internals.block or something

Why would it not be in this file? All the other blocks are. Note that ExtensionBlock is not ever going to be public, only ExtensionArray and ExtensionDtype.

I agree this should be in this file (if we find internals.py too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock is not the base class for those)

this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

By splitting the file first, the diff here will not be smaller

jorisvandenbossche

I will try to test this out with GeoPandas one of the coming days, to give some more feedback

jorisvandenbossche · 2018-01-19T09:18:34Z

pandas/core/arrays/base.py

+          i.e. ``ExtensionArray()`` returns an instance.
+          TODO: See comment in ``ExtensionDtype.construct_from_string``
+        * Your class should be able to be constructed with instances of
+          our class, i.e. ``ExtensionArray(extension_array)`` should returns


Should "our class" be "your class" ? Or should it be able to handle any ExtensionArray subclass (the first would be better IMO)

jorisvandenbossche · 2018-01-19T09:19:01Z

pandas/core/arrays/base.py

+
+        Notes
+        -----
+        As a sequence, __getitem__ should expect integer or slice ``key``.


also boolean mask?

jorisvandenbossche · 2018-01-19T09:20:16Z

pandas/core/arrays/base.py

+        if the slice is length 0 or 1.
+
+        For scalar ``key``, you may return a scalar suitable for your type.
+        The scalar need not be an instance or subclass of your array type.


Is "need not be" enough? (compared to "should not be")
I mean, we won't run into problems in the internals in pandas by seeing arrays where we expect scalars?

I'll clarify this to say

For scalar ``item``, you should return a scalar value suitable for your type. This should be an instance of ``self.dtype.type``.

My earlier phrasing was to explain that the return value for scalars needn't be the type of item that's actually stored in your array. E.g. for my IPAddress example, the array holds two uint64s, but a scalar slice returns an ipaddress.IPv4Address instance.

jorisvandenbossche · 2018-01-19T09:22:37Z

pandas/core/arrays/base.py

+    # ------------------------------------------------------------------------
+    @property
+    def base(self):
+        """The base array I am a view of. None by default."""


If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

jorisvandenbossche · 2018-01-19T09:42:25Z

pandas/core/arrays/base.py

+            Should return an ExtensionArray, even if ``self[slicer]``
+            would return a scalar.
+        """
+        return type(self)(self[slicer])


Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).

jorisvandenbossche · 2018-01-19T09:49:18Z

pandas/core/dtypes/base.py

+        -----
+        The default implementation is True if
+
+        1. 'dtype' is a string that returns true for


"returns true" -> "does not raise" ?

jorisvandenbossche · 2018-01-19T09:52:24Z

pandas/core/dtypes/common.py

+
+    # we want to unpack series, anything else?
+    if isinstance(arr_or_dtype, ABCSeries):
+        arr_or_dtype = arr_or_dtype.values


This will only work if .values will return such a PeriodArray or IntervalArray, and I am not sure we already decided on that?

This is Series.values, what else would it return? An object-typed NumPy array? I think the ship has sailed on Series.values always being a NumPy array.

Let's use Series._values for now? That gets the values of the block, and is certainly an ExtensionArray in case the series holds one.
The we can postpone the decision on what .values returns?

jorisvandenbossche · 2018-01-19T09:54:44Z

pandas/core/internals.py

+
+        Returns
+        -------
+        IntervalArray


jorisvandenbossche · 2018-01-19T09:57:24Z

pandas/core/internals.py

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
        return blocks, mask


+class ExtensionBlock(NonConsolidatableMixIn, Block):


I agree this should be in this file (if we find internals.py too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock is not the base class for those)

jorisvandenbossche · 2018-01-19T10:09:11Z

pandas/core/internals.py

+        # ExtensionArrays must be iterable, so this works.
+        values = np.asarray(self.values)
+        if values.ndim == self.ndim - 1:
+            values = values.reshape((1,) + values.shape)


Is this needed? I know it currently like that in NonConsolidatableBlock, but do we ever expect the result to be a 2D array if this is holded in an DataFrame.
Eg datetimetz block returns here a DatetimeIndex for both .values and .get_values(). On the other hand, a categorical block does this reshaping and returns Categorical vs 2d object numpy array.

Further, do we need to do something with dtype arg?

this is for sparse. which is a step-child ATM. has to be dealt with

jreback · 2018-01-19T11:17:48Z

I think that eventually we can remove NonConsolidatableMixin, with the idea
that all non-consolidatable blocks are blocks for extension dtypes? That's true
today anyway.

yes you should folk NonConsolidatedBlock into the ExtensionBlock infrastructure. maybe be slightly tricky right now as Sparse has some special casing.

jreback · 2018-01-19T11:20:19Z

pandas/core/arrays/categorical.py

@@ -149,7 +151,7 @@ def _maybe_to_categorical(array):
 """


-class Categorical(PandasObject):
+class Categorical(ExtensionArray, PandasObject):


yeah, the methods in PandasObject needs to be ABC in the ExtensionArray

jreback · 2018-01-19T11:23:07Z

pandas/core/dtypes/dtypes.py


-class ExtensionDtype(object):
+
+class PandasExtensionDtype(ExtensionDtype):


repr is reasonable to be part of the interface, caching I suppose is ok here

jreback · 2018-01-19T11:25:24Z

pandas/core/internals.py

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
        return blocks, mask


+class ExtensionBlock(NonConsolidatableMixIn, Block):


this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

jreback · 2018-01-19T11:26:25Z

pandas/core/internals.py

+        # ExtensionArrays must be iterable, so this works.
+        values = np.asarray(self.values)
+        if values.ndim == self.ndim - 1:
+            values = values.reshape((1,) + values.shape)


this is for sparse. which is a step-child ATM. has to be dealt with

jreback · 2018-01-19T11:27:19Z

pandas/core/internals.py

@@ -2410,29 +2525,6 @@ def shift(self, periods, axis=0, mgr=None):
        return self.make_block_same_class(values=self.values.shift(periods),
                                          placement=self.mgr_locs)

-    def take_nd(self, indexer, axis=0, new_mgr_locs=None, fill_tuple=None):
-        """


so as I said above, this PR is doing way way too much. pls just create the ExtensionBlock and just move things. Then in another PR you can do the changes.

This method is just moved a bit up in the file (to the parent class) ?

This really is doing the bare minimum changes to core/internals.py. This won't even allow, e.g. pd.Series(MyExtensionArray()) to work, yet, though I have a follow PR ready to go that does that (with IntervalArray as a test case).

jbrockmendel · 2018-01-19T16:40:43Z

this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

Some overnight-rebasing notwithstanding, I've got a branch ready that splits internals into internals.blocks, internals.managers, internals.joins (and except for import updates is pure cut/paste). Would it be helpful to push that now? I had planned to wait because it will make a ton of rebasing necessary in other PRs.

jorisvandenbossche · 2018-02-01T08:38:36Z

The practical consequence of no longer using abc.ABCMeta, is it only that you now cannot register a class but need to actually subclass? (which doesn't seem a big problem?)

TomAugspurger · 2018-02-01T11:45:03Z

That, and library author-wise, it's now a bit harder to validate that your extension array implements everything. Now it's possible to instantiate an ExtensionArray that doesn't implement the interface. With an ABC that would raise an exception.

jreback · 2018-02-01T12:21:50Z

pandas/core/dtypes/common.py

+    # we want to unpack series, anything else?
+    if isinstance(arr_or_dtype, ABCSeries):
+        arr_or_dtype = arr_or_dtype._values
+    return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


then _get_dtype_or_type needs adjustment. This is the point of compatibility, there shouldn't be the need to have special cases.

jreback · 2018-02-01T12:22:07Z

pandas/core/dtypes/dtypes.py

    """
    A np.dtype duck-typed class, suitable for holding a custom dtype.

    THIS IS NOT A REAL NUMPY DTYPE
    """
-    name = None
-    names = None


can you add some docs about these attributes

They're documented in ExtensionDtype.

jreback · 2018-02-01T12:22:23Z

pandas/core/internals.py

            raise ValueError(
                'Wrong number of items passed {val}, placement implies '
                '{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs)))

+    def _maybe_validate_ndim(self, values, ndim):


thanks. much nicer.

jreback · 2018-02-01T12:23:59Z

pandas/core/internals.py

+    ExtensionArrays are limited to 1-D.
+    """
+    def __init__(self, values, placement, ndim=None):
+        self._holder = type(values)


I think _holder can/should be a property of the Block itself (it can be a cached property)

Block defines _holder as a class attribute. Why would caching it be helpful? It's not reused among multiple ExtensionBlocks.

its just 1 more thing to do in the init which is not needed

jreback · 2018-02-01T12:24:34Z

pandas/core/internals.py

+        else:
+            fill_value = fill_tuple[0]
+
+        # axis doesn't matter; we are really a single-dim object


a doc-string would help here

jreback · 2018-02-01T12:25:16Z

pandas/core/internals.py

@@ -2437,7 +2507,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block):
    _can_hold_na = True

    def __init__(self, values, placement, ndim=None):
-        if values.dtype != _NS_DTYPE:
+        if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE:


something is wrong if you have to change this.

Removed ExtensionBlock.__init__

TomAugspurger · 2018-02-01T14:29:53Z

Done in cd0997e if you could take a look. I did it for every block, which seemed better than a mix of attributes and properties.

…

On Thu, Feb 1, 2018 at 1:07 PM, Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/internals.py <#19268 (comment)>: > @@ -1800,6 +1820,91 @@ def _unstack(self, unstacker_func, new_columns): return blocks, mask +class ExtensionBlock(NonConsolidatableMixIn, Block): + """Block for holding extension types. + + Notes + ----- + This holds all 3rd-party extension array types. It's also the immediate + parent class for our internal extension types' blocks, CategoricalBlock. + + ExtensionArrays are limited to 1-D. + """ + def __init__(self, values, placement, ndim=None): + self._holder = type(values) its just 1 more thing to do in the *init* which is not needed — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIoHyKRRVAeZqotWc_9Qq56iVKPs_ks5tQbctgaJpZM4Rf8hq> .

TomAugspurger · 2018-02-01T16:48:57Z

pandas/core/internals.py

@@ -108,12 +108,7 @@ class Block(PandasObject):
    _concatenator = staticmethod(np.concatenate)

    def __init__(self, values, placement, ndim=None):
-        if ndim is None:


cc @jbrockmendel. Pretty sure this is what you had in mind. Agreed it's cleaner.

(needed a followup commit to use self.ndim a few lines down.)

TomAugspurger · 2018-02-02T03:12:44Z

CI is all green.

TomAugspurger · 2018-02-02T19:13:02Z

Just to make sure, @jorisvandenbossche, @shoyer are you +1 on the changes here?

jorisvandenbossche

Yes, +1 to merge this now, to be able to proceed with the other parts. We will probably need to make minor tweaks to the interface, but that will only become clear by doing next PRs to actually use this.

Added some minor doc comments. And two comments on the interface (take fill_value kwarg and formatting_values), but that can also go in a follow-up.

jorisvandenbossche · 2018-02-02T19:21:35Z

pandas/core/arrays/base.py

+    * take
+    * copy
+    * _formatting_values
+    * _concat_same_type


The two lines above can be removed here and mentioned in the list below (actually only formatting_values, as concat_same_type is already there)

jorisvandenbossche · 2018-02-02T19:23:12Z

pandas/core/arrays/base.py

+    an instance, not error.
+
+    Additionally, certain methods and interfaces are required for proper
+    this array to be properly stored inside a ``DataFrame`` or ``Series``.


this is repetitive with above (the list of methods that are required), and there is also a typo in "for ~~proper~~ this array to be properly"

jorisvandenbossche · 2018-02-02T19:25:19Z

pandas/core/arrays/base.py

+
+        Examples
+        --------
+        Suppose the extension array somehow backed by a NumPy structured array


let's make this just "NumPy array", as this is not specific to structured arrays

jorisvandenbossche · 2018-02-02T19:28:05Z

pandas/core/arrays/base.py

+           def take(self, indexer, allow_fill=True, fill_value=None):
+               mask = indexer == -1
+               result = self.data.take(indexer)
+               result[mask] = self._fill_value


One question here is: should the keyword argument fill_value actually be honored? (like if fill_value is None: fill_value = self._fill_value)
Are there case where pandas will actually pass a certain value?

In any case some clarification would be helpful, also if it is just in the signature for compatibility but may be ignored (maybe in a follow-up).

re fill_value, I based this off Categorical.take. It does assert isna(fill_value), but otherwise ignores it.

I think that since ExtensionArray.take returns an ExtensionArray, most implementations will just ignore fill_value. I'll clarify it in the docs.

jorisvandenbossche · 2018-02-02T19:29:53Z

pandas/core/arrays/base.py

+        # type: () -> np.ndarray
+        # At the moment, this has to be an array since we use result.dtype
+        """An array of values to be printed in, e.g. the Series repr"""
+        raise AbstractMethodError(self)


Maybe we can provide a default implementation of return np.asarray(self) ? (so a densified object array). That is what I do in geopandas, and I suppose would also work for the IPadresses ?

Yes, I suppose that'll be OK for many implementations.

TomAugspurger · 2018-02-02T19:56:28Z

Thanks @jorisvandenbossche, I've fixed up the docs in my followup branch.

And I think you're right about a default implementation for _formatting_values, writing tests for that as well.

shoyer · 2018-02-02T19:57:00Z

Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub is crashing when I try to load the page with this PR :).

…

On Fri, Feb 2, 2018 at 11:13 AM Tom Augspurger ***@***.***> wrote: Just to make sure, @jorisvandenbossche <https://github.com/jorisvandenbossche>, @shoyer <https://github.com/shoyer> are you +1 on the changes here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1pAcUdeZ9GkAMr2bUAfuH7Fs0hZeks5tQ15DgaJpZM4Rf8hq> .

TomAugspurger · 2018-02-02T19:58:12Z

Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub
is crashing when I try to load the page with this PR :).

This is partly why I'm hoping to push further changes into a followup :)

jorisvandenbossche · 2018-02-02T19:59:05Z

Yes, for me the same. I think the page is only loading in 1 out of 5 tries or so .. Another reason to merge :-) 2018-02-02 20:57 GMT+01:00 Stephan Hoyer <notifications@github.com>:

…

Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub is crashing when I try to load the page with this PR :). On Fri, Feb 2, 2018 at 11:13 AM Tom Augspurger ***@***.***> wrote: > Just to make sure, @jorisvandenbossche > <https://github.com/jorisvandenbossche>, @shoyer > <https://github.com/shoyer> are you +1 on the changes here? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#19268 (comment) >, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ ABKS1pAcUdeZ9GkAMr2bUAfuH7Fs0hZeks5tQ15DgaJpZM4Rf8hq> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA-SUJgmb4nGDrSBl0URDRGzsEeR9p_8ks5tQ2iPgaJpZM4Rf8hq> .

jreback · 2018-02-02T21:00:10Z

i haven’t had a look
but i guess it’s ok

TomAugspurger · 2018-02-02T21:34:40Z

Alrighty, thanks! Followup PRs inbound :)

jorisvandenbossche · 2018-02-02T21:38:06Z

i haven’t had a look

but i guess it’s ok don't hesitate to still add comments here, they can then always be dealt with in follow-up prs 2018-02-02 22:34 GMT+01:00 Tom Augspurger <notifications@github.com>:

…

Alrighty, thanks! Followup PRs inbound :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA-SUMR-MY81ioq-2REvrqJaRP6eN4BAks5tQ39zgaJpZM4Rf8hq> .

…9268) * REF: Define extension base classes * Updated for comments * removed take_nd * Changed to_dense to return get_values * Fixed docstrings, types * Removed is_sparse * Remove metaclasses from PeriodDtype and IntervalDtype * Fixup form_blocks rebase * Restore concat casting cat -> object * Remove _slice, clarify semantics around __getitem__ * Document and use take. * Clarify type, kind, init * Remove base * API: Remove unused __iter__ and get_values * API: Implement repr and str * Remove default value_counts for now * Fixed merge conflicts * Remove implementation of construct_from_string * Example implementation of take * Cleanup ExtensionBlock * Pass through ndim * Use series._values * Removed repr, updated take doc * Various cleanups * Handle get_values, to_dense, is_view * Docs * Remove is_extension, is_bool Remove inherited convert * Sparse formatter * Revert "Sparse formatter" This reverts commit ab2f045. * Unbox SparseSeries * Added test for sparse consolidation * Docs * Moved to errors * Handle classmethods, properties * Use our AbstractMethodError * Lint * Cleanup * Move ndim validation to a method. * Try this * Make ExtensionBlock._holder a property Removed ExtensionBlock.__init__ * Make _holder a property for all * Refactored validate_ndim * fixup! Refactored validate_ndim * lint

TomAugspurger added Enhancement Internals Related to non-user accessible pandas implementation labels Jan 16, 2018

TomAugspurger added this to the 0.23.0 milestone Jan 16, 2018

TomAugspurger changed the title ~~Pandas array interface 3~~ Array Interface and Categorical internals Refactor Jan 16, 2018

TomAugspurger mentioned this pull request Jan 16, 2018

ENH: Extending Pandas with custom types #19174

Closed

TomAugspurger commented Jan 16, 2018

View reviewed changes

TomAugspurger mentioned this pull request Jan 16, 2018

REF: Move pandas.core.categorical #19269

Merged

REF: Define extension base classes

2ef5216

TomAugspurger force-pushed the pandas-array-interface-3 branch from c4ff28f to 2ef5216 Compare January 18, 2018 18:24

shoyer reviewed Jan 18, 2018

View reviewed changes

TomAugspurger added 4 commits January 18, 2018 14:12

Updated for comments

57e8b0f

* removed take_nd * Changed to_dense to return get_values * Fixed docstrings, types * Removed is_sparse

Remove metaclasses from PeriodDtype and IntervalDtype

01bd42f

Fixup form_blocks rebase

ce81706

Restore concat casting cat -> object

87a70e3

jreback requested changes Jan 18, 2018

View reviewed changes

jorisvandenbossche reviewed Jan 19, 2018

View reviewed changes

jreback requested changes Jan 19, 2018

View reviewed changes

Remove _slice, clarify semantics around __getitem__

8c61886

Move ndim validation to a method.

9c06b13

jreback requested changes Feb 1, 2018

View reviewed changes

Try this

7d2cf9c

TomAugspurger mentioned this pull request Feb 1, 2018

CLN: Standardize values coercion during Block initialization #19492

Closed

TomAugspurger added 2 commits February 1, 2018 08:06

Make ExtensionBlock._holder a property

afae8ae

Removed ExtensionBlock.__init__

Make _holder a property for all

cd0997e

Refactored validate_ndim

1d6eb04

TomAugspurger commented Feb 1, 2018

View reviewed changes

fixup! Refactored validate_ndim

92aed49

TomAugspurger mentioned this pull request Feb 1, 2018

New DataFrame feature: listify() and unlistify() #10511

Closed

lint

34134f2

jorisvandenbossche approved these changes Feb 2, 2018

View reviewed changes

TomAugspurger merged commit e8620ab into pandas-dev:master Feb 2, 2018

TomAugspurger deleted the pandas-array-interface-3 branch February 2, 2018 21:34

TomAugspurger mentioned this pull request Feb 2, 2018

ENH: Allow storing ExtensionArrays in containers #19520

Merged

TomAugspurger mentioned this pull request Feb 14, 2018

ExtensionArray meta-issue #19696

Closed

15 tasks

twoertwein mentioned this pull request Oct 1, 2022

Mark methods raising AbstractMethodError as abstractmethods? #48909

Open


		# Methods we can (probably) ignore and just use Block's:

		# * replace / replace_single


		class ExtensionDtype(object):

		class PandasExtensionDtype(ExtensionDtype):

		@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
		return blocks, mask


		class ExtensionBlock(NonConsolidatableMixIn, Block):

Array Interface and Categorical internals Refactor #19268

Array Interface and Categorical internals Refactor #19268

Conversation

TomAugspurger commented Jan 16, 2018 • edited Loading

pep8speaks commented Jan 16, 2018 • edited Loading

Comment last updated on February 01, 2018 at 20:55 Hours UTC

Choose a reason for hiding this comment

jorisvandenbossche Jan 16, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger Jan 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 16, 2018

jreback commented Jan 16, 2018

TomAugspurger commented Jan 16, 2018

TomAugspurger commented Jan 16, 2018

TomAugspurger commented Jan 16, 2018

TomAugspurger commented Jan 16, 2018

codecov bot commented Jan 18, 2018 • edited Loading

Codecov Report

TomAugspurger commented Jan 18, 2018 • edited Loading

TomAugspurger commented Jan 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 16, 2018 •

edited

Loading

pep8speaks commented Jan 16, 2018 •

edited

Loading

jorisvandenbossche Jan 16, 2018 •

edited

Loading

TomAugspurger Jan 16, 2018 •

edited

Loading

codecov bot commented Jan 18, 2018 •

edited

Loading

TomAugspurger commented Jan 18, 2018 •

edited

Loading

TomAugspurger Jan 18, 2018 •

edited

Loading