Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38325: [Python] Expand the Arrow PyCapsule Interface with C Device Data support #40708

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 63 additions & 4 deletions docs/source/format/CDataInterface/PyCapsuleInterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ The Arrow PyCapsule Interface
Rationale
=========

The :ref:`C data interface <c-data-interface>` and
:ref:`C stream interface <c-stream-interface>` allow moving Arrow data between
The :ref:`C data interface <c-data-interface>`, :ref:`C stream interface <c-stream-interface>`
and :ref:`C device interface <c-device-data-interface>` allow moving Arrow data between
different implementations of Arrow. However, these interfaces don't specify how
Python libraries should expose these structs to other libraries. Prior to this,
many libraries simply provided export to PyArrow data structures, using the
Expand All @@ -43,7 +43,7 @@ Goals
-----

* Standardize the `PyCapsule`_ objects that represent ``ArrowSchema``, ``ArrowArray``,
and ``ArrowArrayStream``.
``ArrowArrayStream``, ``ArrowDeviceArray`` and ``ArrowDeviceArrayStream``.
* Define standard methods that export Arrow data into such capsule objects,
so that any Python library wanting to accept Arrow data as input can call the
corresponding method instead of hardcoding support for specific Arrow
Expand Down Expand Up @@ -80,7 +80,10 @@ Arrow structures are recognized, the following names must be used:
- ``arrow_array``
* - ArrowArrayStream
- ``arrow_array_stream``

* - ArrowDeviceArray
- ``arrow_device_array``
* - ArrowDeviceArrayStream
- ``arrow_device_array_stream``

Lifetime Semantics
------------------
Expand All @@ -95,6 +98,10 @@ the data and marked the release callback as null, so there isn’t a risk of
releasing data the consumer is using.
:ref:`Read more in the C Data Interface specification <c-data-interface-released>`.

In case of a device struct, the above mentioned release callback is the
``release`` member of the embedded ``ArrowArray`` structure.
:ref:`Read more in the C Device Interface specification <c-device-data-interface-semantics>`.

Just like in the C Data Interface, the PyCapsule objects defined here can only
be consumed once.

Expand All @@ -110,6 +117,11 @@ The interface consists of three separate protocols:
* ``ArrowArrayExportable``, which defines the ``__arrow_c_array__`` method.
* ``ArrowStreamExportable``, which defines the ``__arrow_c_stream__`` method.

Two additional protocols are defined for the Device interface:

* ``ArrowDeviceArrayExportable``, which defines the ``__arrow_c_device_array__`` method.
* ``ArrowDeviceStreamExportable``, which defines the ``__arrow_c_device_stream__`` method.

ArrowSchema Export
------------------

Expand Down Expand Up @@ -142,6 +154,22 @@ Arrays and record batches (contiguous tables) can implement the method
respectively. The schema capsule should have the name ``"arrow_schema"``
and the array capsule should have the name ``"arrow_array"``.

Libraries supporting the Device interface can implement a ``__arrow_c_device_array__``
method on those objects, which works the same as ``__arrow_c_array__`` except
for returning a ArrowDeviceArray structure instead of a ArrowArray structure:

.. py:method:: __arrow_c_device_array__(self, requested_schema: object | None = None) -> Tuple[object, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we already did it above, but it's not useful to add machine-oriented type annotations to a human-readable doc. The parameter types are described explicitly below.

The HTML rendering is not terrible but it's not great either, as the signature looks crowded: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine removing them. In general, I like having the type hints as they make signatures easy to understand from my human perspective. They can be less ambiguous than a description of a type. But given we don't have a PyCapsule type, I agree they don't add much value here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also still have the type hints version in the https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints section a bit below

Will remove them here.


Export the object as a pair of ArrowSchema and ArrowDeviceArray structures.

:param requested_schema: A PyCapsule containing a C ArrowSchema representation
of a requested schema. Conversion to this schema is best-effort. See
`Schema Requests`_.
:type requested_schema: PyCapsule or None

:return: A pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray,
respectively. The schema capsule should have the name ``"arrow_schema"``
and the array capsule should have the name ``"arrow_device_array"``.

ArrowStream Export
------------------
Expand All @@ -160,6 +188,23 @@ Tables / DataFrames and streams can implement the method ``__arrow_c_stream__``.
:return: A PyCapsule containing a C ArrowArrayStream representation of the
object. The capsule must have a name of ``"arrow_array_stream"``.

Libraries supporting the Device interface can implement a ``__arrow_c_device_stream__``
method on those objects, which works the same as ``__arrow_c_stream__`` except
for returning a ArrowDeviceArrayStream structure instead of a ArrowArrayStream
structure:

.. py:method:: __arrow_c_device_stream__(self, requested_schema: object | None = None) -> object

Export the object as an ArrowDeviceArrayStream.

:param requested_schema: A PyCapsule containing a C ArrowSchema representation
of a requested schema. Conversion to this schema is best-effort. See
`Schema Requests`_.
:type requested_schema: PyCapsule or None

:return: A PyCapsule containing a C ArrowDeviceArrayStream representation of the
object. The capsule must have a name of ``"arrow_device_array_stream"``.

Schema Requests
---------------

Expand Down Expand Up @@ -217,6 +262,20 @@ function accepts an object implementing one of these protocols.
) -> object:
...

class ArrowDeviceArrayExportable(Protocol):
def __arrow_c_device_array__(
self,
requested_schema: object | None = None
) -> Tuple[object, object]:
...

class ArrowDeviceStreamExportable(Protocol):
def __arrow_c_device_stream__(
self,
requested_schema: object | None = None
) -> object:
...

Examples
========

Expand Down
1 change: 1 addition & 0 deletions docs/source/format/CDeviceDataInterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ Notes:
synchronization is needed for an extension device, the producer
should document the type.

.. _c-device-data-interface-semantics:

Semantics
=========
Expand Down
Loading