-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531
Comments
It doesn't need to. You can have
I'm not sure that's useful. @lidavidm Thoughts? |
Also, this proposal doesn't dwell on the consumer side. Would there be higher-level APIs to construct |
Yes, indeed I currently didn't touch on that aspect. I think that could certainly be useful, but thought to start with the producer side of things. And some consumers might already have an entry point that could be reused for this (for example, duckdb already implicitly reads from any object that is a pandas DataFrale, pyarrow Table, RecordBatch, Dataset/Scanner, RecordBatchReader, polars DataFrame, ...., and they could just extend this to any object implementing this protocol). |
Yes, I don't think we have to recommend a consumer API. But we'll have to choose one for ourselves ;-) |
I'm also not sure it's useful, but it seems we could define |
Just a note that I think I don't know it if it was mentioned in the discussion, but I think it's fairly important that the PyCapsule have a finalizer that calls the
Most ways that I know about to create an ArrowArray ( Did you envision that |
Indeed. And for pyarrow, it could also be something like
To clarify this part a bit, and assume we are talking about the ArrowArray version to keep it simple (not the stream). Currently, a pyarrow.Array can be exported to an ArrowArray, and a pyarrow.RecordBatch as well (but in the second case, you know you always have a struct type).
Yes, it currently essentially returns an array, not a table. We just mostly use for tables in practice. As a concrete example: in the arrow-rs implementation, the RecordBatch conversion to/from pyarrow actually iterates over each field to convert field by field using the C interface on each array, instead of using a single C interface call using a struct array for the full RecordBatch (https://github.com/apache/arrow-rs/blob/3adca539ad9e1b27892a5ef38ac2780aff4c0bff/arrow/src/pyarrow.rs#L167-L204) |
Yes, that's certainly the idea, and discussed in #34031
That's a good question. Other protocols like I think one aspect here is that this second set of methods assume you "just" give access to the actual, underlying data, and not do any conversion (and so are always zero-copy), or otherwise would raise an error if the type is not supported through the protocol (and so there is never a question of what the "correct" type would be for the C interface). |
That doesn't sound particularly useful? Especially as Pandas will probably have to check for other things (such as the datatypes it supports). |
I think you could do both via class Integerish:
def __arrow_c_array__(self, schema=None):
if schema is not None and :
raise ValueError("Only default export supported")
return make_array_capsule(self), make_int_schema_capsule() Of course, if |
I like Dewey's suggestion for having an API to request casting types you don't know how to use, given there is also an API to get the zero-copy schema. This seems especially useful for Arrow "types" that are just encodings of the same logical type, such as Utf8 / LargeUtf8 (and soon StringView). |
Is this issue a duplicate of #34031? |
I think so! |
I suppose the other title is more specific to the PyCapsule than the interface, but there seems to be general agreement that the protocol methods should return a PyCapsule? |
To move this forward, I tried to capture what's been discussed so far into a coherent draft specification. https://docs.google.com/document/d/1Xjh-N7tQTfc2xowxsEv1Pymx4grvsiSEBQizDUXAeNs/edit?usp=sharing @paleolimbot I'm particularly curious if I understood your idea of passing a requested schema in correctly. LMK what you think. If the folks on this thread so far are pleased with this, I'm happy to make a PR and share on the mailing list. |
Thank you for writing this up! You nailed what I had in mind with respect to schema negotiation. What you have here strikes (in my opinion) the perfect balance of easy-to-use (because any complexity of usage only comes up if the caller chooses to use the FWIW, the order I usually see is I don't know if it's worth noting that you can differentiate between a "schema-like object" and an "array-like object" by checking if it implements |
@wjones127 That looks basically good to me. I would make the Relatedly, I've just opened python/cpython#109562 |
Thanks a lot @wjones127 for picking this up! The proposal looks solid to me. One open item from the top post that I want to mention again: what to do with the Device support that in the meantime landed? Do we directly use those structs instead? Probably not, because they are not yet widely supported, so exclusively using that would hinder adoption of this protocol. But it would be good that we at least think about it so we can ensure we can later on expand the protocol in a compatible way if there is demand for having this protocol handle different device types as well? |
I suppose one could argue their low adoption is partly because we haven't done enough to push non-CPU data as a first class thing in Arrow. I'm open to making the protocol implementation always return the device variants instead. It shouldn't be much harder to construct the CPU variants than using the normal C Data Interface. Although for the device interface, I'm not sure if there is expected to be additional negotiation for devices. For example, could a caller request the buffers be moved to CPU if they aren't already there? Or does that not make sense? |
I would be ever-so-slightly in favour of not using the device version for the initial protocol but I also agree that it's worth ensuring that it is implemented in such a way that it doesn't block a future protocol that supports the device-enabled version. I don't think that it would be all that difficult for producers where this matters to implement a (hypothetical, future) If a future consumer wants a device array but is given an object that only implements |
The device interface is still experimental at this point while the C Data Interface is stable. Besides, most developers do not have any experience with non-CPU devices, making the device interface more difficult to implement for them. |
If there's no enthusiasm for the device interface here right now, I'm fine with deferring that to an extension of this API once it's established. |
For reference, @wjones127 opened a PR describing what is discussed above in a spec document, and with implementation for pyarrow -> #37797 |
Pinging some people from libraries that currently already do use the Arrow C Data interface to consume (or produce) arrow data, and currently typically use the cc @Mytherin @pdet for duckdb, you currently call @wjones127 it seems you have been recently committing to the arrow-rs code that does this conversion (https://github.com/apache/arrow-rs/blob/master/arrow/src/pyarrow.rs) @xwu99 for xgboost using the C interface to support Arrow data (dmlc/xgboost@613ec36) @ritchie46 for polars (https://github.com/pola-rs/polars/blob/main/py-polars/src/arrow_interop/to_rust.rs and https://github.com/pola-rs/pyo3-polars/blob/main/pyo3-polars/src/ffi/to_rust.rs) @amunra for py-questdb-client (https://github.com/questdb/py-questdb-client/blob/4584366f6afafcdac4f860354c48b78da8589eb4/src/questdb/dataframe.pxi#L808) Some other places where we also want to update this in the Arrow projects itself are nanoarrow, adbc, the R package, arrow-rs. |
Just pointing to the nanoarrow/Python example where I am excited to replace the existing |
Good idea! This API should be public and documented and dunder methods are a great way to do it. There should also be an equivalent APIs to do the mirror opposite: C ptr to pyarrow.
That said: There ought to be documentation on how to support both APIs (via duck typing) and any differences between them. E.g. What is a PyCapsule? Nice efforts! |
The APIs based on raw C pointers ( The goal of the PyCapsule-based protocols is to be 1) reasonably type-safe, 2) ensure proper memory deallocation when the PyCapsule goes out of scope. The documentation should probably provide examples of how to deal with the PyCapsule objects: 1) in Cython 2) in pure C. |
FYI I've created rendered docs for the proposed protocol here: http://crossbow.voltrondata.com/pr_docs/37797/format/CDataInterface/PyCapsuleInterface.html |
I am creating RC0 at the moment if there are further Release candidates we can try and include it. |
### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I opened two follow-up issues now the experimental spec and a part of the pyarrow implementation is merged:
|
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… Data support (#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: #38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…Device Data support (apache#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: apache#38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Context: we want that Arrow can be used as the format to share data between (Python) libraries/applications, ideally in a generic way that doesn't need to hardcode for specific libraries.
We already have
__arrow_array__
for objects that know how to convert itself to apyarrow.Array
or ChunkedArray. But this protocol is for actual pyarrow objects (so a better name might have been__pyarrow_array__
..), thus tied to the pyarrow library (and also only for arrays, not for tables/batches). For projects that have an (optional) dependency on pyarrow, that is fine, but we want to avoid that this is required (e.g. nanoarrow). However, we also have the Arrow C Data Interface as a more generic way to share Arrow data in-memory focusing on the actual Arrow spec without relying on a specific library implementation.Right now, the way to use the C Interface are the
_export_to_c
and_import_from_c
methods.But those methods are 1) private, advanced APIs (although we can of course decide to make them "official", since many projects are already using them, and document them that way), and 2) again specific to pyarrow (I don't think other projects have adopted the same names).
So other projects (polars, datafusion, duckdb, etc) use those to convert from pyarrow to their own representation. But those projects don't have a similar API to use the C Data Interface to share their data with another (eg to pyarrow, or polars to duckdb, ...).
If we would have a standard Python protocol (dunder) method for this, libraries could implement support for consuming (and producing) objects that expose their data through the Arrow C Interface without having to hard code for specific implementations (such as those libraries currently do for pyarrow).
The most generic protocol would be one supporting the Stream interface, and that could look something like this:
And in addition we could have variants that do the same for the other structs, such
__arrow_c_data__
or__arrow_c_array__
,__arrow_c_schema__
, ..Some design questions:
_export_to_c
on a RecordBatch and RecordBatchReader, where you know this will always return a StructArray representation of one batch, vs the same method on Array where it can return an array of any type. It could be nice to distinguish those use cases for consumers.The text was updated successfully, but these errors were encountered: