Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DOC: an ExtensionDtype.__from_arrow__ method to convert pyarrow.Array into ExtensionArray #29229

Merged

Conversation

jorisvandenbossche
Copy link
Member

xref the discussion in #20612, and a companion to my PR in Arrow: apache/arrow#5512

Summary: to support ExtensionArrays in the conversion of an arrow table into a pandas DataFrame, we need some way to convert pyarrow Arrays into pandas ExtensionArrays, given a certain pandas dtype (in pyarrow, we can for example know the resulting dtype from the stored metadata).

For that, I propose to add the ExtensionDtype.__from_arrow__ method, with the following signature:

class ExtensionDtype:

    def __from_arrow__(self, array: pyarrow.Array) -> pandas.ExtensionArray:
        ...

Note: I only added documentation about it (which should still be expanded) for now, and not a method in the base class (eg a NotImplementedError), because in pyarrow we use hasattr to see if this is supported (see the linked arrow PR).

@jorisvandenbossche jorisvandenbossche added Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 25, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2019

Would this make more sense to be defined on the ExtensionArray itself rather than the Dtype?

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger asked the same question on the issue: #20612 (comment), where I gave some arguments.

I am certainly not tied to it, but reasons I prefer the dtype:

  • It's the dtype object that pyarrow has access to (and not the array class). So if this is defined on the array class, there needs to be another step of indirection pandas_dtype(dtype).construct_array_type().__from_arow__(..) instead of pandas_dtype(dtype).__from_arrow__(..).
  • We don't require a 1-to-1 mapping between dtype and array class, so you can have multiple dtypes using the same array class. We in fact do this ourselves for IntegerArray and Int64Dtype/Int32Dtype/..., and once the array class is constructed, it does not know by which dtype it was constructed. Now in practice for this example it is not a problem, as the IntegerArray will know from the incoming arrow array type which pandas dtype to use, but in theory this is not guaranteed to always be the case.

@jreback
Copy link
Contributor

jreback commented Oct 26, 2019

lgtm - we already construct from the dtype internally so this is appropriate

@jorisvandenbossche
Copy link
Member Author

Note: this PR is ready from my part, but let's leave it open until the feature is actually merged in Arrow (based on feedback there, details might still change).

@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Oct 29, 2019
@jorisvandenbossche
Copy link
Member Author

OK, this is merged in arrow now. So merging here as well then.

@jorisvandenbossche jorisvandenbossche merged commit 39777cb into pandas-dev:master Nov 8, 2019
@jorisvandenbossche jorisvandenbossche deleted the EA-from-arrow branch November 8, 2019 14:05
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants