Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add NDArrayBackedExtensionArray to public API #45544

Closed
wants to merge 31 commits into from

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Jan 21, 2022

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@tswast
Copy link
Contributor Author

tswast commented Jan 21, 2022

In the db-dtypes package (time and date data types at the moment), we found NDArrayBackedExtensionArray to be quite useful. Unfortunately, this is a private API at the moment. Per googleapis/python-db-dtypes-pandas#28 (comment), sending a PR to make this public.

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 23, 2022
@jreback
Copy link
Contributor

jreback commented Jan 23, 2022

no objections

cc @jbrockmendel @jorisvandenbossche

@@ -19,6 +19,7 @@
ExtensionArray,
ExtensionScalarOpsMixin,
)
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray
Copy link
Contributor

@jreback jreback Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am shocked we don't have a tests that asserts this api e.g. test_api.py can you add one ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably import this into pandas.core.arrays.__init__ and here do the import from pandas.core.arrays

@jbrockmendel
Copy link
Member

I've been meaning to get around to this for a while. Thanks @tswast for beating me to it!

One other thing we should ideally do is add a paragraph in the doc/source/development/extending.rst about this. something to the effect of "if your EA is a thin wrapper around an ndarray, you can save some effort by using ..."

@tswast tswast marked this pull request as ready for review January 24, 2022 17:08
@tswast
Copy link
Contributor Author

tswast commented Jan 24, 2022

Added some tests and docs. Should be ready for review now.

@jbrockmendel jbrockmendel mentioned this pull request Jan 24, 2022
4 tasks
@jreback jreback added this to the 1.5 milestone Jan 28, 2022
@github-actions github-actions bot added the Stale label Aug 3, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 22, 2022
@tswast
Copy link
Contributor Author

tswast commented Aug 25, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Just pulled in the latest changes. Would love to get this documented, as it has been helpful in building the dbtime and dbdate dtypes in https://github.com/googleapis/python-db-dtypes-pandas

Convert a value or values for use in setting a value or values in the backing
NumPy array.

``_validate_searchsorted_value``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 2.0 i think this is going away and we'll re-use _validate_setitem_value for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified that most implementations will be identical to _validate_setitem_value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_searchsorted_value is gone now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove _validate_searchsorted_value here



To support 2D arrays, use the ``_from_backing_data`` helper function when a
method is called on multi-dimensional data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify the data should be of the same dtype as self._ndarray?

@jbrockmendel
Copy link
Member

couple of comments, otherwise lgtm. thanks for your patience

@jbrockmendel
Copy link
Member

@tswast can you merge main. some of the deprecations discussed above have been enforced

@tswast
Copy link
Contributor Author

tswast commented Nov 23, 2022

@jbrockmendel I've synced to main and addressed a remaining docs build warning.

_internal_fill_value = numpy.datetime64("NaT")

def __init__(self, values):
backing_array_dtype = "<M8[ns]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a np.dtype object instead of a string


def _validate_setitem_value(self, value):
if pandas.api.types.is_list_like(value):
return [self._validate_scalar(v) for v in value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an ndarray of the same dtype as self._ndarray

@@ -65,6 +65,7 @@ Other enhancements
- :func:`timedelta_range` now supports a ``unit`` keyword ("s", "ms", "us", or "ns") to specify the desired resolution of the output index (:issue:`49824`)
- :meth:`DataFrame.to_json` now supports a ``mode`` keyword with supported inputs 'w' and 'a'. Defaulting to 'w', 'a' can be used when lines=True and orient='records' to append record oriented json lines to an existing json file. (:issue:`35849`)
- Added ``name`` parameter to :meth:`IntervalIndex.from_breaks`, :meth:`IntervalIndex.from_arrays` and :meth:`IntervalIndex.from_tuples` (:issue:`48911`)
- :class:`NDArrayBackedExtensionArray` now exposed in the public API. (:issue:`45544`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no trailing period

def test_api(self):
checkthese = self.classes + self.funcs + self.misc

self.check(namespace=extensions, expected=checkthese)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not that familiar with this test file. what is being tested here?

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@MichaelTiemannOSC
Copy link
Contributor

@tswast now that 2.2 is out the door, and now that Pandas 2.x in general has made huge strides in its ExtensionArray implementation(s), could we get this PR back to active status by merging in main and resubmitting? If you don't have time for this, perhaps I can take a crack at it (with help from other Pint-Pandas people). In that case, what is the git-friendly way for me to pick up where you left off? We have an active use case that's ready to put this to the test.

CC: @andrewgsavage

@andrewgsavage
Copy link

@MichaelTiemannOSC I had a crack at it in my PR #56755. It's still active, just waiting on some help from pandas devs.

@tswast
Copy link
Contributor Author

tswast commented Jan 22, 2024

Thanks @MichaelTiemannOSC and @andrewgsavage , I don't have time to revive this, so I appreciate your efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants