ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

jorisvandenbossche · 2019-09-26T13:26:47Z

Prototype for https://issues.apache.org/jira/browse/ARROW-2428

What does this PR do?

Based on the pandas_metadata (stored when creating a Table from a pandas DataFrame), we infer which columns originally had a pandas extension dtype, and support a custom conversion (based on a __from_arrow__ method defined on the pandas extension dtype)
The user can also specify explicitly with the extension_column keyword which columns should be converted to an extension dtype

This only covers use case 1 discussed in the issue: automatic roundtrip for pandas DataFrames that have extension dtypes.
So it eg does not yet provide a way to do this if the arrow.Table has no pandas metadata (did not originate from a pandas DataFrame)

github-actions · 2019-09-29T18:46:03Z

https://issues.apache.org/jira/browse/ARROW-2428

wesm

The implementation is somewhat gnarly but I don't see an easy way to achieve the desired outcome otherwise.

wesm · 2019-10-03T22:21:47Z

python/pyarrow/pandas_compat.py

+        name = columns[placement[0]]
+        pandas_dtype = extension_columns[name]
+        if not hasattr(pandas_dtype, '__from_arrow__'):
+            raise ValueError("This column does not support")


Incomplete?

wesm · 2019-10-03T22:30:27Z

python/pyarrow/tests/test_pandas.py

+
+    try:
+        # patch pandas Int64Dtype to have the protocol method
+        pd.Int64Dtype.__from_arrow__ = _Int64Dtype__from_arrow__


Aside: is there a monkey-patch context manager somewhere?

pytest has a monkeypatch fixture that we can use. This would look like

def test_convert_to_extension_array(monkeypatch): ... monkeypatch.setattr(pd.Int64Dtype, "__from_arrow__", _Int64Dtype__from_arrow__, raising=False) ....

without the need for the try/except to ensure clean-up

wesm · 2019-10-03T22:32:17Z

I also don't know whether exposing the extension_columns parameter publicly makes sense.

jorisvandenbossche · 2019-10-04T07:24:45Z

I also don't know whether exposing the extension_columns parameter publicly makes sense.

Yes, I agree, I am also not yet fully clear about that part of the API. For now, it was mainly for being able to test this a bit more.

Currently, I provide an "automatic detection" based on the stored pandas_metadata (which enables automatic roundtrip).

But, I think we need other mechanisms as well:

arrow extension types need to be able to specify a custom pandas dtype to convert to as well (one option is that they can point to a pandas ExtensionDtype object, and then use the same ExtensionDtype.__from_arrow__ logic).
It would be nice to have some way to override all this logic. For example, consider the use case where you have a parquet file (or arrow memory in general) that does not originate from pandas (and thus has no pandas metadata), and you want to ensure that all integer columns are converted to the pandas nullable integer type instead of the default numpy int64 dtype. One option could be to specify (somehow somewhere) a type map like {pa.int64(): pd.IntegerDtype} (but how would that work with parametrized types?)

wesm · 2019-10-08T21:36:14Z

Seems like a little more work is needed here, at least some docstrings. Let me know when you want me to take another look

jorisvandenbossche · 2019-10-08T21:40:14Z

I think #5162 could already be merged (based on your comment there #5162 (comment))? Then I will update this PR after that.

… conversion

… pandas metadata)

jorisvandenbossche · 2019-10-30T13:13:30Z

OK, this should be ready for review now.

I added now support for 2 of the cases of https://issues.apache.org/jira/browse/ARROW-2428:

If there is pandas_metadata, we will check which columns originated from pandas extension dtypes
If you defined a pyarrow.ExtensionType, we check if the ExtensionType.to_pandas_dtype method returns a pandas extension dtype

In both cases, the actual conversion of the pyarrow array to the pandas ExtensionArray is dispatched to the pandas ExtensionDtype.__from_arrow__ method, which is a method on the pandas extension dtype that knows how to convert a arrow array into an extension array of its dtype.

jorisvandenbossche · 2019-10-30T14:41:45Z

I did some performance profiling to see the impact of this change, as checking dtypes can be quite expensive, and given that for dataframes with basic numpy types this will be only overhead without any benefit.

For a relatively normal sized dataframe (100_000 rows x 100 columns) of only floats, this gives a small overhead (from 26ms to 29ms for Table.to_pandas()).
But since this check scales linearly with the number of columns, for a "wide" dataframe (100 rows x 1000 columns) there is a big slowdown: 10ms to 22ms.
See https://gist.github.com/jorisvandenbossche/d36bceb82fd2dda38ee419ba51dff5ed

Now, the most expensive part (~90% of the overhead) of the code I added, is the pandas_dtype(..) function (a function from pandas that converts a string to either an (registered) ExtensionDtype or either a numpy dtype). Since we only need this in case it would return an ExtensionDtype, we could avoid calling this function if we know in advance we have a string representation of a numpy dtype. And since numpy supports a limited set of numpy dtypes, we could hardcode those "known" strings to avoid calling pandas_dtype. For the case above of simple floats dataframe, this would remove most of the overhead.
-> since it is an easy way to avoid overhead, will implement this.

Another (additional) option could also be to add a keyword to to_pandas to disable the extension type checks, so a user that wants the best performance and knows it only has basic types can disable this.

wesm

+1. This looks OK here, thanks @jorisvandenbossche!

…ypes in to_pandas conversions See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case. This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion. One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example: ``` table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) ``` will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype. Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone). Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits: cb82f5c <Joris Van den Bossche> expand tests 1d9c37c <Joris Van den Bossche> simplify (remove unused extension_columns arg) b61b1f5 <Joris Van den Bossche> dict -> function f3464b1 <Joris Van den Bossche> ARROW-7569: Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch 2 times, most recently from 3712104 to e499b44 Compare September 26, 2019 13:56

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch from e499b44 to a0b4829 Compare October 1, 2019 13:31

jorisvandenbossche mentioned this pull request Oct 3, 2019

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

Closed

wesm reviewed Oct 3, 2019

View reviewed changes

kszucs force-pushed the master branch from fc93312 to af097e6 Compare October 5, 2019 09:47

kszucs force-pushed the ARROW-2428-arrow-pandas-conversion branch from a0b4829 to 28b1199 Compare October 5, 2019 10:03

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch from 28b1199 to 2cd29d8 Compare October 9, 2019 07:31

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch from 2cd29d8 to 5df424b Compare October 25, 2019 18:08

This was referenced Oct 25, 2019

API/DOC: an ExtensionDtype.__from_arrow__ method to convert pyarrow.Array into ExtensionArray pandas-dev/pandas#29229

Merged

Add Polygon support holoviz/datashader#819

Closed

jorisvandenbossche added 2 commits October 30, 2019 12:05

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas…

e2b4b62

… conversion

Also support arrow ExtensionTypes via to_pandas_dtype (without having…

6f6b6f6

… pandas metadata)

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch from 3d669bc to 013d904 Compare October 30, 2019 13:07

jorisvandenbossche marked this pull request as ready for review October 30, 2019 13:07

clean-up, remove extension_column kwarg in to_pandas, add docs

9572641

jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch from 013d904 to 9572641 Compare October 30, 2019 13:09

Avoid pandas_dtype check for known numpy dtypes

dc8abac

wesm approved these changes Nov 5, 2019

View reviewed changes

wesm closed this in 7f4165c Nov 5, 2019

jorisvandenbossche deleted the ARROW-2428-arrow-pandas-conversion branch November 5, 2019 13:08

jorisvandenbossche mentioned this pull request Jan 14, 2020

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

Closed

This was referenced Jan 14, 2020

[Python] Add API to map Arrow types (including extension types) to pandas ExtensionArray instances for to_pandas conversions #18536

Closed

[Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions #23827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

jorisvandenbossche commented Sep 26, 2019 •

edited

Loading

github-actions bot commented Sep 29, 2019

wesm left a comment

wesm Oct 3, 2019

wesm Oct 3, 2019

jorisvandenbossche Oct 4, 2019

wesm commented Oct 3, 2019

jorisvandenbossche commented Oct 4, 2019

wesm commented Oct 8, 2019

jorisvandenbossche commented Oct 8, 2019

jorisvandenbossche commented Oct 30, 2019

jorisvandenbossche commented Oct 30, 2019

wesm left a comment

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

Conversation

jorisvandenbossche commented Sep 26, 2019 • edited Loading

github-actions bot commented Sep 29, 2019

wesm left a comment

Choose a reason for hiding this comment

wesm Oct 3, 2019

Choose a reason for hiding this comment

wesm Oct 3, 2019

Choose a reason for hiding this comment

jorisvandenbossche Oct 4, 2019

Choose a reason for hiding this comment

wesm commented Oct 3, 2019

jorisvandenbossche commented Oct 4, 2019

wesm commented Oct 8, 2019

jorisvandenbossche commented Oct 8, 2019

jorisvandenbossche commented Oct 30, 2019

jorisvandenbossche commented Oct 30, 2019

wesm left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 26, 2019 •

edited

Loading