Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38325: [Python] Expand the Arrow PyCapsule Interface with C Device Data support #40708

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 21, 2024

Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods __arrow_c_schema/array/stream__ (#35531 / #37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

What changes are included in this PR?

Update the specification to defined two additional dunders:

  • __arrow_c_device_array__ returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
  • __arrow_c_device_stream__ returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream"

Are these changes tested?

Spec-only change

Copy link

⚠️ GitHub issue #38325 has been automatically assigned in GitHub to PR creator.

method on those objects, which works the same as ``__arrow_c_array__`` except
for returning a ArrowDeviceArray structure instead of a ArrowArray structure:

.. py:method:: __arrow_c_device_array__(self, requested_schema: object | None = None) -> Tuple[object, object]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we already did it above, but it's not useful to add machine-oriented type annotations to a human-readable doc. The parameter types are described explicitly below.

The HTML rendering is not terrible but it's not great either, as the signature looks crowded: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine removing them. In general, I like having the type hints as they make signatures easy to understand from my human perspective. They can be less ambiguous than a description of a type. But given we don't have a PyCapsule type, I agree they don't add much value here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also still have the type hints version in the https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints section a bit below

Will remove them here.

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

I don't think this is controversial, but perhaps this can be publicized on the ML to entice more feedback?

It could perhaps even be publicized towards the DLPack and Numba developers / communities.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me and I look forward to implementing in nanoarrow!

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 26, 2024
protocol (e.g. only add ``__arrow_c_device_array__``, and not add ``__arrow_c_array__``).
Libraries that have data structures that can live both on CPU or non-CPU devices
can implement both versions of the protocol (in that case, the standard methods
should raise an error when trying to export non-CPU data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps we should leave that up to the producer for the time being, instead of making a recommendation?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, for a CPU-only consumer that only checks the __arrow_c_array__ version, it would be good that this consumer can be ensured the pointers it is going to interpret are valid for CPU memory? How can this consumer otherwise avoid segfaults if getting passed a non-CPU object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the producer should be allowed to make a CPU copy.

non-CPU memory, it is recommeded to _only_ implement the device version of the
protocol (e.g. only add ``__arrow_c_device_array__``, and not add ``__arrow_c_array__``).
Libraries that have data structures that can live both on CPU or non-CPU devices
can implement both versions of the protocol (in that case, the standard methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"standard" is a bit misleading as all methods are part of the spec. Perhaps "CPU-only"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did "define" the standard methods as the CPU only ones in the second paragraph listing the two sets of methods, in the idea to not have to constantly repeat CPU-only (eg also the paragraph just above (the third paragraph) uses that terminology).
But maybe that's not helping, and I can certainly also just consistently use "CPU-only" and "device-aware" to refer to the different versions of the methods.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 26, 2024
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Mar 26, 2024
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes!

docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2024
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, modulo minor nits already pointed to by @zeroshade . Thank you!

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after my previous comments are addressed

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 24, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit preview-docs

Copy link

Revision: 182910e

Submitted crossbow builds: ursacomputing/crossbow @ actions-add7f4281d

Task Status
preview-docs GitHub Actions

@jorisvandenbossche
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche merged commit 9dec272 into apache:main Jun 26, 2024
7 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Jun 26, 2024
@jorisvandenbossche jorisvandenbossche deleted the 38325-capsule-device-spec branch June 26, 2024 09:46
jorisvandenbossche added a commit that referenced this pull request Jun 26, 2024
…yArrow (#40717)

### Rationale for this change

PyArrow implementation for the specification additions being proposed in
#40708

### What changes are included in this PR?

New `__arrow_c_device_array__` method to `pyarrow.Array` and
`pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`,
`pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume
objects that have those methods.

### Are these changes tested?

Yes (for CPU only for now, #40385 is
a prerequisite to test this for CUDA)


* GitHub Issue: #38325
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9dec272.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…Device Data support (apache#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: apache#38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…a in PyArrow (apache#40717)

### Rationale for this change

PyArrow implementation for the specification additions being proposed in
apache#40708

### What changes are included in this PR?

New `__arrow_c_device_array__` method to `pyarrow.Array` and
`pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`,
`pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume
objects that have those methods.

### Are these changes tested?

Yes (for CPU only for now, apache#40385 is
a prerequisite to test this for CUDA)


* GitHub Issue: apache#38325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants