Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jan 14, 2020

See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case.

This PR adds a types_mapping to Table.to_pandas to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion.
One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example:

table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})

will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype.

Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone).

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member Author

So I think a main problem with my approach here is the fact that I use pyarrow type instances in the mapping to compare with.
That would be quite cumbersome for parametrized types, eg timestamp (although the possible options here are limited) or dictionary type. Since a dictionary type with different indices/values types will evaluate unequal.

Are there alternatives to the type instance? I think a "name" would be most ergonomic, but pyarrow types don't have a general name (only a str representation, but that includes the parametrization). Another option would be the type.id, but that seems less user-friendly (and actually, that would also be equal for all extension types).
Or a fully different API alltogether?

cc @wesm

@wesm
Copy link
Member

wesm commented Jan 22, 2020

Another option is to have the mapper be a function.

So the case you have here is that the function is always types_mapping.__getitem__. So allowing any function f(pyarrow.DataType) -> ExtensionDType would require a bit more mental gymnastics but be more flexible. Thoughts?

@jorisvandenbossche
Copy link
Member Author

Another option is to have the mapper be a function.

Ah, that sounds as a good idea! It's more flexible for the harder cases (like dictionary with different indices types, or extension types), and not much more difficult for the easy cases if you have a dict with the mapping.
I will maybe go with types_mapping.get instead of types_mapping.__getitem__ as how the function should work for this case (so a function that returns None instead of errors if no pandas ExtensionDtype should be used).

if not hasattr(pandas_dtype, "__from_arrow__"):
raise ValueError("this column does not support to be "
"converted to extension dtype")
ext_columns[name] = pandas_dtype
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this "else" for when extension_columns was specified, as this is no longer used (I added it initially to be able to specify which columns to convert to test this, when inference from the metadata was not yet implemented)

(this makes the diff a bit harder, but in this whole _get_extension_dtypes basically only the if types_mapper: block is added (the rest is only dedented))

@jorisvandenbossche
Copy link
Member Author

OK, I changed types_mapping to types_mapper being a function, and expanded the tests a bit. I think this should be good now.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants