-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Expose the device interface through the Arrow PyCapsule protocol #38325
Comments
Looking at the DLPack python page (https://dmlc.github.io/dlpack/latest/python_spec.html), they also define a |
I would in favor of the second approach (separate dunder methods). Python methods are cheap, and these are really two different protocols. |
I also would favour a separate dunder method. A consumer that is expecting an |
The downside of using a separate dunder method is that you run into ecosystem fragmentation and inefficiencies. I.E. there was a similar effect that happened with Compare that to dlpack (https://github.com/dmlc/dlpack) and the associated Python protocol surrounding it (https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html) which has had device support as a primary goal from its inception. Deep Learning libraries that have multiple device backends like PyTorch, Tensorflow, JAX, etc. all support dlpack but they don't support |
I don't get the fragmentation and inefficiency argument. Implementing the C Data Interface is necessary to also implement the C Device Data Interface (since the latter is based on the former). Almost no implementation effort is saved by eschewing the former to go with the latter. In any case, the C Data Interface already exists in C++, C#, Go, Java, Rust, nanoarrow and several third-party projects... It's simply logical to expose it in Python as well. |
…evice Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by #38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…he C Device Interface (apache#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by apache#38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: apache#39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I would like to revive this, now there is some movement in exposing the Device Interface in pyarrow (low level bindings have been added, #39979) and in supporting it in cudf (rapidsai/cudf#15047). Concretely, I think adding a separate set of dunder methods, The actual methods can mimic the current ones (see https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export), just with adapted names:
And both methods can then similarly also accept a Some questions:
|
Yes, this is correct.
This can always be added later if there proves to be a need for it. I would push to keep things minimal for now to avoid needing to make breaking changes.
+1 to not defining |
…he C Device Interface (apache#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by apache#38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: apache#39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…Device Data support
Thanks for the feedback @kkraus14. I started a PR updating the documentation page about the PyCapsule protocol to expand with those two new dunder methods -> #40708 Apart from just defining
|
+0. I'm not sure it makes sense to ask producers for more effort in this regard.
+1
+1
+1
Yes, we should. The expectation of the (regular) C Data Interface is that data lives on the CPU.
In the producer or in the consumer? IMHO the consumer is free to do whatever suits them. On the producer side the question is a bit more delicate. Perhaps we need to pass some options to |
+1 to all! Perhaps just +0 to:
I am not sure I would recommend that (although it is easy to provide producers boilerplate cython or helper functions to help them do it if they would like).
As a consumer I suppose I would be very surprised if a producer performed a cross-device copy to happen in a call to |
My preference would be that at the outset the interface recommends that no implicit copies occur, but that there is no normative language prohibiting it. I agree with @paleolimbot that in a vacuum a copy would be surprising, but for objects that support data resident on either host or device (e.g. a pytorch array) it could be reasonable for this to happen. If a library already supports implicit back-and-forth copying via its public APIs then it should be acceptable for the same to happen via the capsule interface (in either direction, i.e. you could access the device interface and trigger H2D or vice versa), whereas if a library requires some explicit transfer mechanism (e.g. APIs like
I would recommend that we wait on this and see how the implementations look in practice. We could always add a |
Since @jorisvandenbossche mentioned the ongoing work in cudf's C++ to support the C device data interface in rapidsai/cudf#15047, I put together a (very, very preliminary) concept of what exposing this in Python could look like in rapidsai/cudf#15370. There is a lot of work to be done before that PR can come out of draft, but I'm happy to help people on this thread use it as a testing ground for ideas on the Python capsule interface if that's useful. Feedback is of course welcome, but given how preliminary the implementation is I wouldn't waste any time on details. For now I haven't implemented the dunder methods on any classes, but the free functions are doing the same capsule construction and should be trivial to toss into a class's dunder method once cudf decides exactly what to expose. In the meantime the free functions offer the same interface for testing. |
I would be +1 on encouraging being able to consume both protocols, but +0.5 on encouraging to produce both protocols. Agreed with @pitrou's comment about it being more effort with the only return being future proofing for potential non-cpu usage. I do think we should start discussing what does it look like for a CPU-only library to request data from a non-CPU library. I.E. if say Pandas wants to consume data produced by cuDF. Presuming we figure out a convenient api to pass that information to cuDF, would it return an |
While I agree that supporting a version attribute is more work, it also feels more natural than introspecting an API to determine what is supported. If I read code relying on signature inspection I assume it's doing something fairly hacky by default (which isn't to say that I haven't written such code on multiple occasions before...). If we don't want versioning, another option (that I'm not necessarily a huge fan of but is worth considering) is defining the protocol as a function that accepts arbitrary keyword arguments
This way all options are only honored if they are supported. Otherwise, it's up to the consumer if they want this to fail loudly or if they want to handle things themselves. One downside of this approach is that it assumes that consumers are fine stating the superset of all options that they want and then accepting whether or not producers will honor those options. If consumers want to dynamically adjust the calls based on the options available from a particular producer, they will have to resort to doing so more manually (e.g. via |
It's moreso that we're talking about this topic because of the desire to introduce something like a @vyasr's proposal above is interesting, and we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of v1 producerclass Foo:
def __arrow_device_array__(self, *, strict=True, **kwargs):
for key, value in kwargs.items():
if value is not None:
raise NotImplementedError(f"{key}" parameter is not supported)
return capsule v2 producerclass Foo:
def __arrow_device_array__(self, *, strict=True, requested_device=None, **kwargs):
for key, value in kwargs.items():
if value is not None:
raise NotImplementedError(f"{key}" parameter is not supported)
# handle `requested_device` here in a meaningful way where None implies same behavior as v1
return capsule From a consumer perspective, passing any new parameter becomes purely optional and they can handle unsupported parameters via |
Sorry, I don't understand this paragraph. How is adding the
I don't have a strong objection to add that, but to be clear, this all still means that the consumer has to do the work to check if the keyword is supported or not or has been honored or not. To use a concrete example, with the current proposal of no special mechanism, and if we add a try:
capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except TypeError:
capsules = object.__arrow_c_device_array__()
# manually check the returned array capsule has a CPU device and otherwise error With the proposal above to add the try:
capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except NotImplementedError:
capsules = object.__arrow_c_device_array__()
# manually check the returned array capsule has a CPU device and otherwise error or if the capsules = object.__arrow_c_device_array__(requested_device=kCPU, strict=False)
# manually check the returned array capsule has a CPU device and otherwise error
Exactly the same is true for no explicit specification of adding
I think that is something that we will do anyway in practice: if we add a new keyword in the future, it should always have a default that matches the previous behaviour, such that not specifying it preserves behaviour (otherwise it would be a breaking change). To summarize, if there is a strong desire to add the |
I think we should avoid vague and confusing terms such as As for As for always taking non_default_kwargs = [name for name, value in kwargs.items() if value is not None]
if non_default_kwargs:
raise NotImplementedError(f"some nice error message with {non_default_kwargs}") |
Do we then require that the producer raise an error if unknown kwargs / kwargs with non-default values are passed? (like your example code snippet to do that) |
Yes, we probably want to. If a producer ignores a hypothetical |
Python does that automatically for you (raising an error for an unknown keyword) if not adding |
The specifics for
Does that sound OK? If we add this keyword, do we keep the notion that a non-CPU library is allowed to also implement the non-device-aware methods like |
Exactly. |
Only mentioning this because I just ran across it, but dlpack apparently uses a consumer-provided version:
For what it's worth, I think a producer-defined
Would the request also have to specify the
From a CPU calling library perspective, I would prefer to only have to deal with Adding |
Note that for DLPack this is about the C ABI version, not the Python interface. So it allows the user to request a certain version of the struct to be put in the returned capsule (DLPack has different versions of the C struct). That's something different as what we are discussing here. |
That's a good question, hopefully @kkraus14 or @vyasr could answer that. If needed, then the argument would be a tuple of two ints for
Likewise here, happy to spec and implement any agreed proposal, but my personal preference would still be to just start with the simplest option (no keywords, no special kwarg handling for future evolution) with clearly labeling it as experimental, and re-evaluate this in some time after we have initial implementations and usage. |
To clarify, it sounds like we're trying to make two separate decisions here. It's important to decouple the requirements imposed by the need for a painless syntactic evolution of the protocol from the semantics of any particular feature of the protocol. The motivation behind the That discussion is separate from the semantics of any particular option. Some options may not make sense to allow defaults, or to have optional behavior, and we might still need to make such arguments explicit the way that Keith did with On the specific topic of the |
My understanding was that we only want to allow the consumer to pass a default without raising an exception. As a producer, if any parameter with a non-default value is passed that the producer in question does not support, it should still raise an error? But from your above post, it seems you are arguing for actually silencing any errors? |
Hmm my interpretation of various comments above like this one:
was that we wanted a way for consumers to be producer-agnostic up to and including allowing producers to silently ignore arguments that they do not support. How would we distinguish from whether an API was provided with a default value explicitly vs because that value is the default in the signature itself? To do that we would need to expose something like pandas NoDefault, which doesn't work here since we're defining a protocol without any centralized implementation where such a "no default" default object could live. Perhaps we should start sketching out some slightly more realistic examples than the ones in this thread so far, both of the v0 proposal and potential changes we could envision in v1 (even if we don't plan to implement them right now). Making that concrete ought to help us decide what behavior we should expect and what failure modes we should be handling. I know that a few folks have expressed a preference for getting started on a simpler prototype sooner rather than spending too long discussing, which I am in favor of too. So maybe a real v0 prototype, plus a sample v1 prototype that makes some set of changes which we could imagine being realistic but don't actually want to deal with in v0. Then we can assess based on that. |
The idea mentioned earlier was to require that any new keyword will have a default of Keith proposed this first (#38325 (comment)):
And later also Antoine (#38325 (comment)):
(with each time an example code of how handling that would look like, see the links) To me, the main reason that I don't really like the idea of silently ignoring unsupported keywords is that then as a consumer you still need to check in the resulting data if your keyword was honored, and have a fall back for when it was not honored. At that point it becomes very similar to a try/except. For the example of a
Fully agreed that concrete examples are helpful. But I think we can essentially treat the In #38325 (comment) above, I already showed some different code snippets how it would look like for a consumer to use this keyword depending on which mechanism we use to add new keywords (no special handling, let the producer raise for non-default value, allow asking to silence errors and ignore unsupported keywords). |
I apologize for the exceedingly slow responses here, work has been in a state of transition recently and I'm still getting my bearings a bit. Right, so basically there are three proposals for the API here:
Your message above effectively compares 1 and 2. Antoine and Keith's are both of 2. I think 2 is a bit more work for the producer for nearly identical consumer code, but I do think there is a real improvement and it's worth the extra code. A producer-generated I thought 3 might be of interest because of some of the prior discussions around making a best effort to honor some keywords but allowing them to also not be honored. In that case it might be easier via |
Apologies for the slow response from my side now. Re-reading the discussion above, I think there is general support for what is called option 2 is the last comment just above: the interface methods accept So then I would propose to just go with that, adding those |
Review of the spec additions in #40708 would be very welcome. |
… Data support (#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: #38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Issue resolved by pull request 40708 |
…yArrow (#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in #40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, #40385 is a prerequisite to test this for CUDA) * GitHub Issue: #38325
Thanks @jorisvandenbossche! Once some of the other dust settles I'll look into getting this implemented in cudf and we can start testing. |
…Device Data support (apache#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: apache#38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…a in PyArrow (apache#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in apache#40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, apache#40385 is a prerequisite to test this for CUDA) * GitHub Issue: apache#38325
We added a new protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and new dunder methods
__arrow_c_schema/array/stream__
(#35531 / #37797).We recently also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).
The currently merged PyCapsule protocol uses the stable non-device interface, but so the question is how to integrate the device version in the protocol in order to expose the C Device Data Interface in Python as well. Some options:
Only support the device versions going forward (like currently only the cpu version is supported, i.e. the returned capsules always contain a device array/stream).
(this is a backwards incompatible change, but given we labeled the protocol as experimental, we can still make such changes if we think this is the best long-term option. The capsule names would reflect this change, thus this will generate a proper python error if a consumer or producer would not yet have been updated, and we can actually first deprecate the non-device support in pyarrow before removing it. All to say that AFAIU this is perfectly possible if we want it.)
Add separate dunder methods
__arrow_c_device_array__
and__arrow_c_device_stream__
, and then it is up to the producer to implement those dunders if they can (and we can strongly recommend doing that, also for CPU-only libraries), and to consumers to check which ones are present.Allow the consumer to request a device array with some keyword (eg
__array_c_array__(device=True)
), which gives the consumer the option to request it while also still giving the producer the possibility to raise an error if they don't (yet) support the device version.Support both options in the current methods without keyword, i.e. allow
__arrow_c_array__
to return both a"arrow_array"
or"arrow_device_array"
capsule (and their capsule name distinguishes both). With the recommendation to always return a device version if you can, but allowing producers to still return a cpu version if they don't support the device one. This only gives some flexibility to the producer, and no control to the consumer to request the CPU version (so this essentially expects that all consumers will handle the device version)Options 2/3/4 are probably just variants of how to expose both interfaces, and thus the main initial question is whether we want to, long term, move towards an ecosystem where everyone uses the C Device Data Interface, or to keep using both interfaces side by side (as the main interchange mechanism, I mean, the device interface of course still embeds the standard struct).
The text was updated successfully, but these errors were encountered: