-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512
ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512
Conversation
3712104
to
e499b44
Compare
e499b44
to
a0b4829
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is somewhat gnarly but I don't see an easy way to achieve the desired outcome otherwise.
python/pyarrow/pandas_compat.py
Outdated
name = columns[placement[0]] | ||
pandas_dtype = extension_columns[name] | ||
if not hasattr(pandas_dtype, '__from_arrow__'): | ||
raise ValueError("This column does not support") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete?
python/pyarrow/tests/test_pandas.py
Outdated
|
||
try: | ||
# patch pandas Int64Dtype to have the protocol method | ||
pd.Int64Dtype.__from_arrow__ = _Int64Dtype__from_arrow__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: is there a monkey-patch context manager somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest has a monkeypatch
fixture that we can use. This would look like
def test_convert_to_extension_array(monkeypatch):
...
monkeypatch.setattr(pd.Int64Dtype, "__from_arrow__", _Int64Dtype__from_arrow__, raising=False)
....
without the need for the try/except to ensure clean-up
I also don't know whether exposing the |
Yes, I agree, I am also not yet fully clear about that part of the API. For now, it was mainly for being able to test this a bit more. Currently, I provide an "automatic detection" based on the stored pandas_metadata (which enables automatic roundtrip). But, I think we need other mechanisms as well:
|
a0b4829
to
28b1199
Compare
Seems like a little more work is needed here, at least some docstrings. Let me know when you want me to take another look |
I think #5162 could already be merged (based on your comment there #5162 (comment))? Then I will update this PR after that. |
28b1199
to
2cd29d8
Compare
2cd29d8
to
5df424b
Compare
3d669bc
to
013d904
Compare
013d904
to
9572641
Compare
OK, this should be ready for review now. I added now support for 2 of the cases of https://issues.apache.org/jira/browse/ARROW-2428:
In both cases, the actual conversion of the pyarrow array to the pandas ExtensionArray is dispatched to the pandas |
I did some performance profiling to see the impact of this change, as checking dtypes can be quite expensive, and given that for dataframes with basic numpy types this will be only overhead without any benefit. For a relatively normal sized dataframe (100_000 rows x 100 columns) of only floats, this gives a small overhead (from 26ms to 29ms for Now, the most expensive part (~90% of the overhead) of the code I added, is the Another (additional) option could also be to add a keyword to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This looks OK here, thanks @jorisvandenbossche!
…ypes in to_pandas conversions See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case. This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion. One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example: ``` table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) ``` will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype. Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone). Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits: cb82f5c <Joris Van den Bossche> expand tests 1d9c37c <Joris Van den Bossche> simplify (remove unused extension_columns arg) b61b1f5 <Joris Van den Bossche> dict -> function f3464b1 <Joris Van den Bossche> ARROW-7569: Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Prototype for https://issues.apache.org/jira/browse/ARROW-2428
What does this PR do?
__from_arrow__
method defined on the pandas extension dtype)extension_column
keyword which columns should be converted to an extension dtypeThis only covers use case 1 discussed in the issue: automatic roundtrip for pandas DataFrames that have extension dtypes.
So it eg does not yet provide a way to do this if the arrow.Table has no pandas metadata (did not originate from a pandas DataFrame)