Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow #5106

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 16, 2019

https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit).

result = pa.array(df['a'].values, type=pa.float64())
assert result.equals(expected2)

del pd.arrays.IntegerArray.__arrow_array__
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could go in a try-finally to avoid a failure here possibly contaminating other tests.

@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
else:
c_from_pandas = from_pandas

if _is_array_like(obj):
if hasattr(obj, '__arrow_array__'):
return obj.__arrow_array__(type=type)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to check for a return value of NotImplemented and fall back to the default path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Why would you define __arrow_array__ just to return NotImplemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if you don't want to implement this, the user should simply not define __arrow_array__.

But we should probably add a check to ensure that was is returned from __arrow_array__ is actually a pyarrow Array (that is something that numpy also does).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: if the Pandas series or dataframe column is backed by an Arrow array, does this allow pa.array() to extract the array without copying?

@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
else:
c_from_pandas = from_pandas

if _is_array_like(obj):
if hasattr(obj, '__arrow_array__'):
return obj.__arrow_array__(type=type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise if non-default size and mask arguments are passed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds as a good idea.

@@ -178,7 +180,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
mask = values.mask
values = values.data

if pandas_api.is_categorical(values):
if hasattr(values, '__arrow_array__'):
return values.__arrow_array__(type=type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: raise if size or mask is given?

python/pyarrow/tests/test_pandas.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

Question: if the Pandas series or dataframe column is backed by an Arrow array, does this allow pa.array() to extract the array without copying?

Yes, that would be the idea (also pa.array(numpy_array) does not create a copy of the data if possible). In that case the __arrow_array__ should return that backing Arrow array (which is what fletcher could do).

@jorisvandenbossche
Copy link
Member Author

Updated the PR and added some docs for it.

.. _extending:

Extending pyarrow
=================
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this file can also contain the documentation about creating your own ExtensionType (documentation which is missing at the moment), so therefore I think extending.rst would be a good name.
However, we already have extending.rst which is about using the pyarrow C++ / cython APIs. Anybody an idea for another name for this file? ("extending2" is not meant to keep :))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extending_types.rst? Or user_defined_types.rst, since ultimately __arrow_array__, ExtensionType, and (eventually) custom Arrow-to-Pandas conversions are all about user-defined types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, renamed to extending_types.rst (the title can still include "user defined types" once we add documentation about that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original document could also be renamed extending_cpp.rst or something.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche for taking this up. LGTM from my side.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one small issue.

python/pyarrow/tests/test_pandas.py Show resolved Hide resolved
@rok
Copy link
Member

rok commented Aug 21, 2019

Looks good @jorisvandenbossche! LGTM.

I'm looking to add support for pd.SparseArray, so setting __arrow_array__ to something like:

def __arrow_array__(self, type=None):
    return pa.SparseTensorCOO.from_numpy(data=self.sp_values, coords=self.sp_index.indices.reshape(1,-1), shape=(self.shape))

However the returned type would then not be pa.array. Do you think it would make sense to support cases where returned type is not pa.array with __arrow_array__ interface?

@jorisvandenbossche
Copy link
Member Author

@rok at the moment, the protocol is specifically meant to convert to a pyarrow.Array (it is also only used internally in the pa.array(..) code).

So I would personally keep it on that, for now. The question is also what would be the exact purpose for extending it? (but I am not very familiar with the tensor part of the library / the message spec).

@rok
Copy link
Member

rok commented Aug 21, 2019

@jorisvandenbossche agreed with keeping this as is.
My motivation is enabling conversion of sparse columns (pd.SparseArray) contained in an arbitrary pd.DataFrame to pyarrow type (probably pa.SparseTensorCOO).
I'm not sure what would the best place to do this be?

@jorisvandenbossche
Copy link
Member Author

My motivation is enabling conversion of sparse columns (pd.SparseArray) contained in an arbitrary pd.DataFrame to pyarrow type (probably pa.SparseTensorCOO).

The main problem is that currently sparse data does not fit in the Arrow tabular format, so in general DataFrames with such sparse columns, can't be converted to Arrow tables. As you mention, the target could be pd.SparseTensorCOO, but currently the main target of this protocol is the recordbatch message format, and not the (sparse) tensor message format.

For the Tensors, I don't think there is currently a "catch all" constructor like you have for pa.array? That might be a start if you want to enable conversion of general objects to (sparse) tensors. But I would propose to open a new issue for that if you want to continue the discussion.

@rok
Copy link
Member

rok commented Aug 22, 2019

Thanks @jorisvandenbossche, I'll look into that direction and here's an issue I opened for it.

"converted with the __arrow_array__ protocol.")
res = obj.__arrow_array__(type=type)
if not isinstance(res, Array):
raise ValueError("The object's __arrow_array__ method does not "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy returns a ValueError in this case, but yes, a TypeError seems more appropriate

@codecov-io
Copy link

Codecov Report

Merging #5106 into master will decrease coverage by 22.35%.
The diff coverage is 85.71%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5106       +/-   ##
===========================================
- Coverage   87.62%   65.26%   -22.36%     
===========================================
  Files        1014      497      -517     
  Lines      145908    67514    -78394     
  Branches     1437        0     -1437     
===========================================
- Hits       127857    44066    -83791     
- Misses      17689    23448     +5759     
+ Partials      362        0      -362
Impacted Files Coverage Δ
python/pyarrow/tests/test_pandas.py 95.26% <100%> (+0.08%) ⬆️
python/pyarrow/pandas-shim.pxi 64.35% <70%> (-0.91%) ⬇️
python/pyarrow/tests/test_array.py 95.65% <71.42%> (-0.35%) ⬇️
python/pyarrow/array.pxi 80.12% <92.3%> (+0.43%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util_internal.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/logical_type.h 0% <0%> (-100%) ⬇️
... and 796 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a40d6b6...bab01f1. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants