-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the Array API #2249
Comments
This indeed sounds very interesting. I saw that the API is in its own namespace so it would not interfere with our Series API. It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy. |
I agree. Awesome initiative. The API seems directed towards tensors, but offering the 1D experience is still quite powerful. |
A little bit offtopic, but I was always wondering: Does Arrow support or prevent efficient tensor representation? Is it materially different and we don't support things that numpy / ndarray / tensorflow handles well? Could we run eg. BLAS / LAPACK over data in Arrow (directly, efficiently)? |
Arrow's IPC specification includes both tensors and sparse tensors. The reason I have not added them to the arrow2 is that there are no integration tests for them atm, so, it is pretty much a wild west. This is an area I wish we could improve in the future, so that we can e.g. have tensors in polars. |
There is the FixedSizeList type. And otherwise you could build a matrix, tensor type around a 1D array. I think all serious tensors are backed by contiguous memory and have their dimensions due to there indexing magic. |
So a numpy array is a Series of uniform 1d tensors or a FixedSizeList of integer/float vectors? Interesting, thanks a lot for the details. |
Yes matrices are typically backed by 1D memory because a I assume arrow lists are backed by linear memory for the same reason. |
There is also the DataFrame API which would seem a better fit for
+:100: |
I'd think we'd want to conform with both, no? |
I think they're two separate things. You're either trying to provide a 2D DataFrame api or an nD Tensor api. It may be that the DataFrame api is implemented as a collection of 1D arrays conforming to the api, but I'd imagine that the DataFrame standard would specify that. As an outside, occasional user, it seems to me that polars is trying to implement a 2D DataFrame api so would best conform to the DataFrame standard. I'm not an expert in |
I imagine projects like NumPy are targeted for the Array API. So not sure if Series fits here, and if it doesn't then the next question is where does that line get drawn with upstream structures used? But I'd assume both DataFrames and Series will consume arrays conforming to the API. Found this comment. |
It may be of interest that it looks like Pandas now implements the DataFrame part of the Array API specification as of yesterday's 1.5.0 release: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#dataframe-interchange-protocol-implementation |
Yeap, I want that too. Any help on this would be very much appreciated. |
This is WIP! See scikit-learn/scikit-learn#22352 for a general overview and scikit-learn/scikit-learn#22554 for its first experimental support. As @dhirschfeld pointed out, the DataFrame API might make more sense for |
Implementing the Array API (https://data-apis.org/array-api/latest/purpose_and_scope.html) would improve the long term interoperability of the data science libraries.
The conformance can be tested using: https://github.com/data-apis/array-api-tests
I know Polars is much higher level lib, but I believe conforming to this protocol while using Polars components could make sense
The text was updated successfully, but these errors were encountered: