Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the Array API #2249

Closed
alippai opened this issue Jan 1, 2022 · 14 comments · Fixed by #6581
Closed

Implement the Array API #2249

alippai opened this issue Jan 1, 2022 · 14 comments · Fixed by #6581
Labels
help wanted Extra attention is needed python Related to Python Polars

Comments

@alippai
Copy link

alippai commented Jan 1, 2022

Implementing the Array API (https://data-apis.org/array-api/latest/purpose_and_scope.html) would improve the long term interoperability of the data science libraries.

The conformance can be tested using: https://github.com/data-apis/array-api-tests

I know Polars is much higher level lib, but I believe conforming to this protocol while using Polars components could make sense

@zundertj zundertj added the python Related to Python Polars label Jan 2, 2022
@ritchie46
Copy link
Member

This indeed sounds very interesting. I saw that the API is in its own namespace so it would not interfere with our Series API.

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

@ritchie46 ritchie46 added the help wanted Extra attention is needed label Jan 2, 2022
@jorgecarleitao
Copy link
Collaborator

I agree. Awesome initiative. The API seems directed towards tensors, but offering the 1D experience is still quite powerful.

@alippai
Copy link
Author

alippai commented Jan 2, 2022

A little bit offtopic, but I was always wondering:
My understanding is that 1D series is pretty straightforward with Arrow and Polars and while it's clunky 2D arrays still work well (vector of Series).

Does Arrow support or prevent efficient tensor representation? Is it materially different and we don't support things that numpy / ndarray / tensorflow handles well? Could we run eg. BLAS / LAPACK over data in Arrow (directly, efficiently)?

@jorgecarleitao
Copy link
Collaborator

Arrow's IPC specification includes both tensors and sparse tensors.

The reason I have not added them to the arrow2 is that there are no integration tests for them atm, so, it is pretty much a wild west. This is an area I wish we could improve in the future, so that we can e.g. have tensors in polars.

@ritchie46
Copy link
Member

There is the FixedSizeList type. And otherwise you could build a matrix, tensor type around a 1D array. I think all serious tensors are backed by contiguous memory and have their dimensions due to there indexing magic.

@alippai
Copy link
Author

alippai commented Jan 2, 2022

So a numpy array is a Series of uniform 1d tensors or a FixedSizeList of integer/float vectors? Interesting, thanks a lot for the details.

@ritchie46
Copy link
Member

Yes matrices are typically backed by 1D memory because a Vec<Vec<_>> would have a cache miss at every row/column traversal (and more in higher dimensions).

I assume arrow lists are backed by linear memory for the same reason.

@ritchie46 ritchie46 added the good first issue Good for newcomers label Mar 20, 2022
@dhirschfeld
Copy link

There is also the DataFrame API which would seem a better fit for polars:

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

@cnpryer
Copy link
Contributor

cnpryer commented Jun 5, 2022

There is also the DataFrame API which would seem a better fit for polars:

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

I'd think we'd want to conform with both, no?

@dhirschfeld
Copy link

I'd think we'd want to conform with both, no?

I think they're two separate things. You're either trying to provide a 2D DataFrame api or an nD Tensor api. It may be that the DataFrame api is implemented as a collection of 1D arrays conforming to the api, but I'd imagine that the DataFrame standard would specify that.

As an outside, occasional user, it seems to me that polars is trying to implement a 2D DataFrame api so would best conform to the DataFrame standard. I'm not an expert in polars though!

@cnpryer
Copy link
Contributor

cnpryer commented Jun 5, 2022

I imagine projects like NumPy are targeted for the Array API. So not sure if Series fits here, and if it doesn't then the next question is where does that line get drawn with upstream structures used?

But I'd assume both DataFrames and Series will consume arrays conforming to the API.

Found this comment.

@kylebarron
Copy link
Contributor

It may be of interest that it looks like Pandas now implements the DataFrame part of the Array API specification as of yesterday's 1.5.0 release: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#dataframe-interchange-protocol-implementation

@ritchie46
Copy link
Member

Yeap, I want that too. Any help on this would be very much appreciated.

@jjerphan
Copy link
Contributor

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

This is WIP! See scikit-learn/scikit-learn#22352 for a general overview and scikit-learn/scikit-learn#22554 for its first experimental support.

As @dhirschfeld pointed out, the DataFrame API might make more sense for polars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed python Related to Python Polars
Projects
None yet
9 participants