Interchange between two dataframe types which use the same native storage representation #48

rgommers · 2021-07-22T15:25:25Z

This was brought up by @jorisvandenbossche: if two libraries both use the same library for in-memory data storage (e.g. buffers/columns are backed by NumPy or Arrow arrays), can we avoid iterating through each buffer on each column by directly handing over that native representation?

This is a similar question to https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol_summary.md#what-is-wrong-with-to_numpy-and-to_arrow - but it's not the same, there is one important difference. The key point of that FAQ entry is that it's consumers who should rely on NumPy/Arrow, and not producers. Having a to_numpy() method somewhere is at odds with that. Here is an alternative:

A Column instance may define __array__ or __arrow_array__ if and only if the column itself is backed by a single NumPy or an Arrow array.
DataFrame and Buffer instance must not define __array__ or __arrow_array__.

(1) is motivated by wanting a simple shortcut like this:

    # inside `from_dataframe` constructor
    for name in df.column_names():
        col = df.get_column_by_name(name)
        # say my library natively uses Arrow:
        if hasattr(col, '__arrow_array__'):
            # apparently we're both using Arrow, take the shortcut
            columns[name] = col.__arrow_array__()
        elif ...: # continue parsing dtypes, null values, etc.

However, there are other constraints then. For __array__ this then also implies:

the column has either no missing values or uses NaN or a sentinel value for nulls (and this needs checking first in the code above - otherwise the consumer may still misinterpret the data)
this does not work for categorical or string dtypes - those are not representable by a single array

For __arrow_array__ I cannot think of issues right away. Of course the producer should also be careful to ensure that there are no differences in behavior due to adding one of these methods. For example, if there's a dataframe with a nested dtype that is supported by Arrow but not by the protocol, calling __dataframe__() should raise because of the unsupported dtype.

The main pro of doing this is:

A potential performance gain in the dataframe conversion (TBD how significant)

The main con is:

Extra code complexity to get that performance gain, because now there are two code paths on the consumer side and both must be equivalent.

My impression is: this may be useful to do for __arrow_array__, I don't think it's a good idea for __array__ because the gain is fairly limited and there's too many constraints or ways to get it wrong (e.g. describe_null must always be checked before using __array__). If __array__ is to be added, then maybe at the Buffer level where it plays the same role as __dlpack__.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-07-26T20:02:32Z

I haven't been in the discussions lately but drive by commenting since I have some strong opinions about this and was one of the main people voicing concerns about the to_numpy and to_arrow stuff:

The main pro of doing this is:

A potential performance gain in the dataframe conversion (TBD how significant)

Is this performance gain to just eliminate the control flow code of constructing say Arrow's containers around memory that we'd be passing around zero copy anyway? If this was implemented in C/C++ (which I imagine most Python libraries would end up doing) then I'd argue this becomes negligible anyway.

For arrow_array I cannot think of issues right away.

Arrow Array objects are backed by Arrow Buffer objects which is an abstract interface that can be backed by CPU or GPU or future devices memory. This wouldn't make any guarantees about where the memory is, only what the container is and possibly give a standard API to work with against the container (though most Arrow APIs will currently throw exceptions or segfault if you try to use them with GPU memory).

My 2c: we should keep the interchange protocol limited to a memory layout description and focus on ensuring we can make the memory interchange zero copy and then doing our best to ensure libraries can use it as efficiently as possible.

rgommers · 2021-08-24T13:31:40Z

Thanks for the input @kkraus14

Is this performance gain to just eliminate the control flow code of constructing say Arrow's containers around memory that we'd be passing around zero copy anyway? If this was implemented in C/C++ (which I imagine most Python libraries would end up doing) then I'd argue this becomes negligible anyway.

Yes indeed, just about control flow. And I agree it'd be a very minor gain.

Arrow Array objects are backed by Arrow Buffer objects which is an abstract interface that can be backed by CPU or GPU or future devices memory. This wouldn't make any guarantees about where the memory is, only what the container is and possibly give a standard API to work with against the container (though most Arrow APIs will currently throw exceptions or segfault if you try to use them with GPU memory).

I'm actually not quite sure how to interpret this bit. Why would these guarantees be needed (if __arrow_array__ is used by both consumer and producer, it seems like this should "just work")?

My 2c: we should keep the interchange protocol limited to a memory layout description and focus on ensuring we can make the memory interchange zero copy and then doing our best to ensure libraries can use it as efficiently as possible.

This does sound like the better option to me too - it's less complexity overall.

kkraus14 · 2021-08-24T18:37:35Z

I'm actually not quite sure how to interpret this bit. Why would these guarantees be needed (if __arrow_array__ is used by both consumer and producer, it seems like this should "just work")?

Because you're not guaranteed that downstream of every consumer is just using high level dataframe code / PyArrow code. Someone could have an extension written in C/C++ that assumes buffers are in CPU memory for example.

So then we still need to inspect the flag to determine whether to copy the data to the CPU and presumably call a PyArrow specific API to get a new PyArrow array backed by CPU memory. It adds a bunch of complexity for basically 0 gain.

jorisvandenbossche · 2023-10-12T13:50:16Z

My 2c: we should keep the interchange protocol limited to a memory layout description and focus on ensuring we can make the memory interchange zero copy and then doing our best to ensure libraries can use it as efficiently as possible.

This does sound like the better option to me too - it's less complexity overall.

I opened #279 as an alternative to this issue but to achieve the same goal. That proposal is then actually only about a memory layout, without being tied to a specific library (i.e. pyarrow in this case).

It wouldn't yet support GPU (since the Arrow PyCapsule interface doesn't support that yet), but GPU dataframe interchange objects can then simply not add those methods for now to indicate they don't support this.

rgommers added the interchange-protocol label Jul 22, 2021

rgommers added a commit to rgommers/dataframe-api that referenced this issue Aug 24, 2021

Add note on __arrow_array__ and link to data-apisgh-48

070b9cf

jakirkham mentioned this issue Sep 2, 2021

Using Array API functions on DataFrame objects #50

Open

jorisvandenbossche mentioned this issue Sep 14, 2021

Interchange protocol use case: getting certain columns as numpy array #66

Open

cnpryer mentioned this issue Jun 19, 2022

WIP: Implement DataFrame Interchange Protocol pola-rs/polars#3727

Closed

3 tasks

jorisvandenbossche mentioned this issue Oct 12, 2023

Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interchange between two dataframe types which use the same native storage representation #48

Interchange between two dataframe types which use the same native storage representation #48

rgommers commented Jul 22, 2021

kkraus14 commented Jul 26, 2021

rgommers commented Aug 24, 2021

kkraus14 commented Aug 24, 2021

jorisvandenbossche commented Oct 12, 2023

Interchange between two dataframe types which use the same native storage representation #48

Interchange between two dataframe types which use the same native storage representation #48

Comments

rgommers commented Jul 22, 2021

kkraus14 commented Jul 26, 2021

rgommers commented Aug 24, 2021

kkraus14 commented Aug 24, 2021

jorisvandenbossche commented Oct 12, 2023