ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

jorisvandenbossche · 2020-01-14T12:11:24Z

See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case.

This PR adds a types_mapping to Table.to_pandas to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion.
One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example:

table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})

will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype.

Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone).

github-actions · 2020-01-14T12:23:56Z

https://issues.apache.org/jira/browse/ARROW-7569

jorisvandenbossche · 2020-01-17T11:45:45Z

So I think a main problem with my approach here is the fact that I use pyarrow type instances in the mapping to compare with.
That would be quite cumbersome for parametrized types, eg timestamp (although the possible options here are limited) or dictionary type. Since a dictionary type with different indices/values types will evaluate unequal.

Are there alternatives to the type instance? I think a "name" would be most ergonomic, but pyarrow types don't have a general name (only a str representation, but that includes the parametrization). Another option would be the type.id, but that seems less user-friendly (and actually, that would also be equal for all extension types).
Or a fully different API alltogether?

cc @wesm

wesm · 2020-01-22T21:55:42Z

Another option is to have the mapper be a function.

So the case you have here is that the function is always types_mapping.__getitem__. So allowing any function f(pyarrow.DataType) -> ExtensionDType would require a bit more mental gymnastics but be more flexible. Thoughts?

jorisvandenbossche · 2020-01-23T08:40:16Z

Another option is to have the mapper be a function.

Ah, that sounds as a good idea! It's more flexible for the harder cases (like dictionary with different indices types, or extension types), and not much more difficult for the easy cases if you have a dict with the mapping.
I will maybe go with types_mapping.get instead of types_mapping.__getitem__ as how the function should work for this case (so a function that returns None instead of errors if no pandas ExtensionDtype should be used).

…ypes for to_pandas conversions

jorisvandenbossche · 2020-01-23T09:41:21Z

python/pyarrow/pandas_compat.py

-            if not hasattr(pandas_dtype, "__from_arrow__"):
-                raise ValueError("this column does not support to be "
-                                 "converted to extension dtype")
-            ext_columns[name] = pandas_dtype


I removed this "else" for when extension_columns was specified, as this is no longer used (I added it initially to be able to specify which columns to convert to test this, when inference from the metadata was not yet implemented)

(this makes the diff a bit harder, but in this whole _get_extension_dtypes basically only the if types_mapper: block is added (the rest is only dedented))

jorisvandenbossche · 2020-01-23T09:43:02Z

OK, I changed types_mapping to types_mapper being a function, and expanded the tests a bit. I think this should be good now.

wesm

+1

jorisvandenbossche force-pushed the ARROW-7569-to-pandas-types-mapping branch from c14d0d1 to c5bf928 Compare January 17, 2020 11:46

jorisvandenbossche added 4 commits January 23, 2020 09:53

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDt…

f3464b1

…ypes for to_pandas conversions

dict -> function

b61b1f5

simplify (remove unused extension_columns arg)

1d9c37c

expand tests

cb82f5c

jorisvandenbossche force-pushed the ARROW-7569-to-pandas-types-mapping branch from c5bf928 to cb82f5c Compare January 23, 2020 09:39

jorisvandenbossche commented Jan 23, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 23, 2020

ENH: add use_nullable_dtypes option in read_parquet pandas-dev/pandas#31242

Merged

nealrichardson closed this in 61c6b95 Jan 23, 2020

wesm reviewed Jan 23, 2020

View reviewed changes

jorisvandenbossche deleted the ARROW-7569-to-pandas-types-mapping branch January 23, 2020 19:00

asfimport mentioned this pull request Jan 23, 2020

[Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions #23827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

jorisvandenbossche commented Jan 14, 2020 •

edited

Loading

github-actions bot commented Jan 14, 2020

jorisvandenbossche commented Jan 17, 2020

wesm commented Jan 22, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche Jan 23, 2020 •

edited

Loading

jorisvandenbossche commented Jan 23, 2020

wesm left a comment

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

Conversation

jorisvandenbossche commented Jan 14, 2020 • edited Loading

github-actions bot commented Jan 14, 2020

jorisvandenbossche commented Jan 17, 2020

wesm commented Jan 22, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche Jan 23, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 23, 2020

wesm left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 14, 2020 •

edited

Loading

jorisvandenbossche Jan 23, 2020 •

edited

Loading