Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 26, 2019

Prototype for https://issues.apache.org/jira/browse/ARROW-2428

What does this PR do?

  • Based on the pandas_metadata (stored when creating a Table from a pandas DataFrame), we infer which columns originally had a pandas extension dtype, and support a custom conversion (based on a __from_arrow__ method defined on the pandas extension dtype)
  • The user can also specify explicitly with the extension_column keyword which columns should be converted to an extension dtype

This only covers use case 1 discussed in the issue: automatic roundtrip for pandas DataFrames that have extension dtypes.
So it eg does not yet provide a way to do this if the arrow.Table has no pandas metadata (did not originate from a pandas DataFrame)

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-2428-arrow-pandas-conversion branch 2 times, most recently from 3712104 to e499b44 Compare September 26, 2019 13:56
@github-actions
Copy link

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is somewhat gnarly but I don't see an easy way to achieve the desired outcome otherwise.

name = columns[placement[0]]
pandas_dtype = extension_columns[name]
if not hasattr(pandas_dtype, '__from_arrow__'):
raise ValueError("This column does not support")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete?


try:
# patch pandas Int64Dtype to have the protocol method
pd.Int64Dtype.__from_arrow__ = _Int64Dtype__from_arrow__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: is there a monkey-patch context manager somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest has a monkeypatch fixture that we can use. This would look like

def test_convert_to_extension_array(monkeypatch):
    ...
    monkeypatch.setattr(pd.Int64Dtype, "__from_arrow__", _Int64Dtype__from_arrow__, raising=False)
    ....

without the need for the try/except to ensure clean-up

@wesm
Copy link
Member

wesm commented Oct 3, 2019

I also don't know whether exposing the extension_columns parameter publicly makes sense.

@jorisvandenbossche
Copy link
Member Author

I also don't know whether exposing the extension_columns parameter publicly makes sense.

Yes, I agree, I am also not yet fully clear about that part of the API. For now, it was mainly for being able to test this a bit more.

Currently, I provide an "automatic detection" based on the stored pandas_metadata (which enables automatic roundtrip).

But, I think we need other mechanisms as well:

  • arrow extension types need to be able to specify a custom pandas dtype to convert to as well (one option is that they can point to a pandas ExtensionDtype object, and then use the same ExtensionDtype.__from_arrow__ logic).
  • It would be nice to have some way to override all this logic. For example, consider the use case where you have a parquet file (or arrow memory in general) that does not originate from pandas (and thus has no pandas metadata), and you want to ensure that all integer columns are converted to the pandas nullable integer type instead of the default numpy int64 dtype. One option could be to specify (somehow somewhere) a type map like {pa.int64(): pd.IntegerDtype} (but how would that work with parametrized types?)

@wesm
Copy link
Member

wesm commented Oct 8, 2019

Seems like a little more work is needed here, at least some docstrings. Let me know when you want me to take another look

@jorisvandenbossche
Copy link
Member Author

I think #5162 could already be merged (based on your comment there #5162 (comment))? Then I will update this PR after that.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review October 30, 2019 13:07
@jorisvandenbossche
Copy link
Member Author

OK, this should be ready for review now.

I added now support for 2 of the cases of https://issues.apache.org/jira/browse/ARROW-2428:

  • If there is pandas_metadata, we will check which columns originated from pandas extension dtypes
  • If you defined a pyarrow.ExtensionType, we check if the ExtensionType.to_pandas_dtype method returns a pandas extension dtype

In both cases, the actual conversion of the pyarrow array to the pandas ExtensionArray is dispatched to the pandas ExtensionDtype.__from_arrow__ method, which is a method on the pandas extension dtype that knows how to convert a arrow array into an extension array of its dtype.

@jorisvandenbossche
Copy link
Member Author

I did some performance profiling to see the impact of this change, as checking dtypes can be quite expensive, and given that for dataframes with basic numpy types this will be only overhead without any benefit.

For a relatively normal sized dataframe (100_000 rows x 100 columns) of only floats, this gives a small overhead (from 26ms to 29ms for Table.to_pandas()).
But since this check scales linearly with the number of columns, for a "wide" dataframe (100 rows x 1000 columns) there is a big slowdown: 10ms to 22ms.
See https://gist.github.com/jorisvandenbossche/d36bceb82fd2dda38ee419ba51dff5ed

Now, the most expensive part (~90% of the overhead) of the code I added, is the pandas_dtype(..) function (a function from pandas that converts a string to either an (registered) ExtensionDtype or either a numpy dtype). Since we only need this in case it would return an ExtensionDtype, we could avoid calling this function if we know in advance we have a string representation of a numpy dtype. And since numpy supports a limited set of numpy dtypes, we could hardcode those "known" strings to avoid calling pandas_dtype. For the case above of simple floats dataframe, this would remove most of the overhead.
-> since it is an easy way to avoid overhead, will implement this.

Another (additional) option could also be to add a keyword to to_pandas to disable the extension type checks, so a user that wants the best performance and knows it only has basic types can disable this.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This looks OK here, thanks @jorisvandenbossche!

@wesm wesm closed this in 7f4165c Nov 5, 2019
@jorisvandenbossche jorisvandenbossche deleted the ARROW-2428-arrow-pandas-conversion branch November 5, 2019 13:08
nealrichardson pushed a commit that referenced this pull request Jan 23, 2020
…ypes in to_pandas conversions

See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. #5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case.

This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion.
One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example:

```
table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
```

will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype.

Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone).

Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits:

cb82f5c <Joris Van den Bossche> expand tests
1d9c37c <Joris Van den Bossche> simplify (remove unused extension_columns arg)
b61b1f5 <Joris Van den Bossche> dict -> function
f3464b1 <Joris Van den Bossche> ARROW-7569:  Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants