-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions #6189
Conversation
So I think a main problem with my approach here is the fact that I use pyarrow type instances in the mapping to compare with. Are there alternatives to the type instance? I think a "name" would be most ergonomic, but pyarrow types don't have a general name (only a str representation, but that includes the parametrization). Another option would be the cc @wesm |
c14d0d1
to
c5bf928
Compare
Another option is to have the mapper be a function. So the case you have here is that the function is always |
Ah, that sounds as a good idea! It's more flexible for the harder cases (like dictionary with different indices types, or extension types), and not much more difficult for the easy cases if you have a dict with the mapping. |
c5bf928
to
cb82f5c
Compare
if not hasattr(pandas_dtype, "__from_arrow__"): | ||
raise ValueError("this column does not support to be " | ||
"converted to extension dtype") | ||
ext_columns[name] = pandas_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this "else" for when extension_columns
was specified, as this is no longer used (I added it initially to be able to specify which columns to convert to test this, when inference from the metadata was not yet implemented)
(this makes the diff a bit harder, but in this whole _get_extension_dtypes
basically only the if types_mapper:
block is added (the rest is only dedented))
OK, I changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case.
This PR adds a
types_mapping
toTable.to_pandas
to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion.One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example:
will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype.
Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone).