Implement the Array API #2249

alippai · 2022-01-01T22:12:57Z

Implementing the Array API (https://data-apis.org/array-api/latest/purpose_and_scope.html) would improve the long term interoperability of the data science libraries.

The conformance can be tested using: https://github.com/data-apis/array-api-tests

I know Polars is much higher level lib, but I believe conforming to this protocol while using Polars components could make sense

ritchie46 · 2022-01-02T19:40:12Z

This indeed sounds very interesting. I saw that the API is in its own namespace so it would not interfere with our Series API.

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

jorgecarleitao · 2022-01-02T19:43:04Z

I agree. Awesome initiative. The API seems directed towards tensors, but offering the 1D experience is still quite powerful.

alippai · 2022-01-02T20:23:33Z

A little bit offtopic, but I was always wondering:
My understanding is that 1D series is pretty straightforward with Arrow and Polars and while it's clunky 2D arrays still work well (vector of Series).

Does Arrow support or prevent efficient tensor representation? Is it materially different and we don't support things that numpy / ndarray / tensorflow handles well? Could we run eg. BLAS / LAPACK over data in Arrow (directly, efficiently)?

jorgecarleitao · 2022-01-02T20:28:13Z

Arrow's IPC specification includes both tensors and sparse tensors.

The reason I have not added them to the arrow2 is that there are no integration tests for them atm, so, it is pretty much a wild west. This is an area I wish we could improve in the future, so that we can e.g. have tensors in polars.

ritchie46 · 2022-01-02T20:29:24Z

There is the FixedSizeList type. And otherwise you could build a matrix, tensor type around a 1D array. I think all serious tensors are backed by contiguous memory and have their dimensions due to there indexing magic.

alippai · 2022-01-02T20:46:48Z

So a numpy array is a Series of uniform 1d tensors or a FixedSizeList of integer/float vectors? Interesting, thanks a lot for the details.

ritchie46 · 2022-01-02T21:00:36Z

Yes matrices are typically backed by 1D memory because a Vec<Vec<_>> would have a cache miss at every row/column traversal (and more in higher dimensions).

I assume arrow lists are backed by linear memory for the same reason.

dhirschfeld · 2022-03-23T05:57:00Z

There is also the DataFrame API which would seem a better fit for polars:

https://github.com/data-apis/dataframe-api

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

cnpryer · 2022-06-05T02:11:35Z

There is also the DataFrame API which would seem a better fit for polars:

https://github.com/data-apis/dataframe-api

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

I'd think we'd want to conform with both, no?

dhirschfeld · 2022-06-05T10:42:58Z

I'd think we'd want to conform with both, no?

I think they're two separate things. You're either trying to provide a 2D DataFrame api or an nD Tensor api. It may be that the DataFrame api is implemented as a collection of 1D arrays conforming to the api, but I'd imagine that the DataFrame standard would specify that.

As an outside, occasional user, it seems to me that polars is trying to implement a 2D DataFrame api so would best conform to the DataFrame standard. I'm not an expert in polars though!

cnpryer · 2022-06-05T14:25:24Z

I imagine projects like NumPy are targeted for the Array API. So not sure if Series fits here, and if it doesn't then the next question is where does that line get drawn with upstream structures used?

But I'd assume both DataFrames and Series will consume arrays conforming to the API.

Found this comment.

kylebarron · 2022-09-21T15:17:32Z

It may be of interest that it looks like Pandas now implements the DataFrame part of the Array API specification as of yesterday's 1.5.0 release: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#dataframe-interchange-protocol-implementation

ritchie46 · 2022-09-21T15:20:52Z

Yeap, I want that too. Any help on this would be very much appreciated.

jjerphan · 2022-10-21T19:51:39Z

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

This is WIP! See scikit-learn/scikit-learn#22352 for a general overview and scikit-learn/scikit-learn#22554 for its first experimental support.

As @dhirschfeld pointed out, the DataFrame API might make more sense for polars.

zundertj added the python Related to Python Polars label Jan 2, 2022

ritchie46 added the help wanted Extra attention is needed label Jan 2, 2022

ritchie46 added the good first issue Good for newcomers label Mar 20, 2022

cnpryer mentioned this issue Jun 17, 2022

from_pandas converts object data type to f64 if all values null #3725

Closed

jjerphan mentioned this issue Nov 28, 2022

Remove all compilation warnings? #5659

Closed

stinodego mentioned this issue Nov 29, 2022

feat(python): DataFrame interchange protocol implementation #5662

Closed

28 tasks

stinodego removed the good first issue Good for newcomers label Nov 29, 2022

stinodego mentioned this issue Jan 30, 2023

feat(python): Implement DataFrame Interchange Protocol through pyarrow #6581

Merged

ritchie46 closed this as completed in #6581 Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the Array API #2249

Implement the Array API #2249

alippai commented Jan 1, 2022 •

edited

Loading

ritchie46 commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

alippai commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

alippai commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

dhirschfeld commented Mar 23, 2022

cnpryer commented Jun 5, 2022

dhirschfeld commented Jun 5, 2022

cnpryer commented Jun 5, 2022

kylebarron commented Sep 21, 2022

ritchie46 commented Sep 21, 2022

jjerphan commented Oct 21, 2022

Implement the Array API #2249

Implement the Array API #2249

Comments

alippai commented Jan 1, 2022 • edited Loading

ritchie46 commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

alippai commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

alippai commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

dhirschfeld commented Mar 23, 2022

cnpryer commented Jun 5, 2022

dhirschfeld commented Jun 5, 2022

cnpryer commented Jun 5, 2022

kylebarron commented Sep 21, 2022

ritchie46 commented Sep 21, 2022

jjerphan commented Oct 21, 2022

alippai commented Jan 1, 2022 •

edited

Loading