-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38325: [Python] Expand the Arrow PyCapsule Interface with C Device Data support #40708
GH-38325: [Python] Expand the Arrow PyCapsule Interface with C Device Data support #40708
Conversation
…Device Data support
|
method on those objects, which works the same as ``__arrow_c_array__`` except | ||
for returning a ArrowDeviceArray structure instead of a ArrowArray structure: | ||
|
||
.. py:method:: __arrow_c_device_array__(self, requested_schema: object | None = None) -> Tuple[object, object] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we already did it above, but it's not useful to add machine-oriented type annotations to a human-readable doc. The parameter types are described explicitly below.
The HTML rendering is not terrible but it's not great either, as the signature looks crowded: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wjones127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine removing them. In general, I like having the type hints as they make signatures easy to understand from my human perspective. They can be less ambiguous than a description of a type. But given we don't have a PyCapsule
type, I agree they don't add much value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also still have the type hints version in the https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints section a bit below
Will remove them here.
I don't think this is controversial, but perhaps this can be publicized on the ML to entice more feedback? It could perhaps even be publicized towards the DLPack and Numba developers / communities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me and I look forward to implementing in nanoarrow!
protocol (e.g. only add ``__arrow_c_device_array__``, and not add ``__arrow_c_array__``). | ||
Libraries that have data structures that can live both on CPU or non-CPU devices | ||
can implement both versions of the protocol (in that case, the standard methods | ||
should raise an error when trying to export non-CPU data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, perhaps we should leave that up to the producer for the time being, instead of making a recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, for a CPU-only consumer that only checks the __arrow_c_array__
version, it would be good that this consumer can be ensured the pointers it is going to interpret are valid for CPU memory? How can this consumer otherwise avoid segfaults if getting passed a non-CPU object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the producer should be allowed to make a CPU copy.
non-CPU memory, it is recommeded to _only_ implement the device version of the | ||
protocol (e.g. only add ``__arrow_c_device_array__``, and not add ``__arrow_c_array__``). | ||
Libraries that have data structures that can live both on CPU or non-CPU devices | ||
can implement both versions of the protocol (in that case, the standard methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"standard" is a bit misleading as all methods are part of the spec. Perhaps "CPU-only"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did "define" the standard methods as the CPU only ones in the second paragraph listing the two sets of methods, in the idea to not have to constantly repeat CPU-only (eg also the paragraph just above (the third paragraph) uses that terminology).
But maybe that's not helping, and I can certainly also just consistently use "CPU-only" and "device-aware" to refer to the different versions of the methods.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these changes!
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, modulo minor nits already pointed to by @zeroshade . Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after my previous comments are addressed
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
@github-actions crossbow submit preview-docs |
Revision: 182910e Submitted crossbow builds: ursacomputing/crossbow @ actions-add7f4281d
|
There rendered changes look good to me: http://crossbow.voltrondata.com/pr_docs/40708/format/CDataInterface/PyCapsuleInterface.html |
…yArrow (#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in #40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, #40385 is a prerequisite to test this for CUDA) * GitHub Issue: #38325
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9dec272. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…Device Data support (apache#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: apache#38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…a in PyArrow (apache#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in apache#40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, apache#40385 is a prerequisite to test this for CUDA) * GitHub Issue: apache#38325
Rationale for this change
We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods
__arrow_c_schema/array/stream__
(#35531 / #37797).We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).
This expands the Python exposure of the interface with support for the newer Device structs.
What changes are included in this PR?
Update the specification to defined two additional dunders:
__arrow_c_device_array__
returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name__arrow_c_device_stream__
returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream"Are these changes tested?
Spec-only change