Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial adoption of Array API for Xarray's NamedArray #698

Closed
andersy005 opened this issue Oct 31, 2023 · 14 comments
Closed

partial adoption of Array API for Xarray's NamedArray #698

andersy005 opened this issue Oct 31, 2023 · 14 comments

Comments

@andersy005
Copy link

andersy005 commented Oct 31, 2023

we've been having a discussion over in pydata/xarray#8333 about adopting Array API and the possibility of offering an array-API compatible interface via namedarray, a new Xarray data structure that aims to be a lighter version of xarray's Variable: pydata/xarray#8080

although there's general agreement that xarray.NamedArray should not offer an array-API compatible interface --mainly due to Xarray's unique semantics like broadcasting by dimension name rather than dimension order (e.g. xp.mean(xarray.NamedArray, axis=1) vs xarray.NamedArray.mean(dim="time")--we're keen on adopting element-wise functions and how they would interact with xarray.

the most important point at issue is, we find that these element-wise functions like xp.abs(xarray.NamedArray) --> xarray.NamedArray are critical to our use-case. however, we're not looking to adopt the entire Array API standard, but are there any guidelines or recommendations for adopting a subset of it? we thought it's worth raising this issue, to see if folks here think this is a good idea or not.

@asmeurer
Copy link
Member

asmeurer commented Oct 31, 2023

I think the main thing to consider is how usable this would be for consumers of the array API. If you don't accept numbered axes, then any library written against the array API that has code using numbered axes won't actually work with xarray.

@vnmabus
Copy link

vnmabus commented Nov 1, 2023

I planned to use NamedArray just to keep track of dimension names across operations (so I do not have to implement that logic myself), but I needed it to be compatible with the array API standard. It is ok for me if you compare dimension names for some operations (such as attempting to sum different things) and raise an exception in that case. It ideally should also support operations between NamedArrays and the underlying wrapped (unnamed) array type. Would that not be possible?

@andersy005
Copy link
Author

Ccing @shoyer / @Illviljan / @TomNicholas / @dcherian for visibility

@TomNicholas
Copy link

needed it to be compatible with the array API standard.

It is ok for me if you compare dimension names for some operations (such as attempting to sum different things)

@vnmabus I may have misunderstood you but my understanding is that these are two conflicting requirements. Compatibility with the full API standard implies broadcasting by axis number, including in binary operations like sum. But if we compare dimension names to use those for broadcasting during a sum operation then that is a different behaviour, which is inconsistent with broadcasting by axis number.

To see what I mean consider summing two arrays, which are identical up to a transposition.

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: arr = np.array([[1, 2, 3], [4, 5, 6]])

In [4]: arr.shape
Out[4]: (2, 3)

In [5]: arr.T
Out[5]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

In [6]: arr.T.shape
Out[6]: (3, 2)

In [7]: arr + arr.T
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 arr + arr.T

ValueError: operands could not be broadcast together with shapes (2,3) (3,2) 

In [8]: var = xr.Variable(data=arr, dims=['x', 'y'])

In [9]: var.shape
Out[9]: (2, 3)

In [10]: var.T
Out[10]: 
<xarray.Variable (y: 3, x: 2)>
array([[1, 4],
       [2, 5],
       [3, 6]])

In [11]: var.T.shape
Out[11]: (3, 2)

In [12]: var + var.T
Out[12]: 
<xarray.Variable (x: 2, y: 3)>
array([[ 2,  4,  6],
       [ 8, 10, 12]])

You can see that in numpy (broadcasting by axis number) then the array shapes (2,3), (3,2) are not broadcastable, but in xarray they are. (You could argue about the definition of the transpose here but I think then you would just end up with xarray breaking a different part of the array API standard.)

@TomNicholas
Copy link

TomNicholas commented Nov 2, 2023

Now perhaps we can re-explain @andersy005 's question above: operations on scalars are the only part of the array API where axis number & broadcasting rules are irrelevant, so it's the only part of the array API standard that xarray could obey. The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

@TomNicholas
Copy link

It ideally should also support operations between NamedArrays and the underlying wrapped (unnamed) array type.

Finally you could imagine allowing this, but it would create a weird situation where xarray-only code was robust to transposition, but any operations between xarray and numpy objects were not invariant to transposition.

@vnmabus
Copy link

vnmabus commented Nov 3, 2023

Well, I was not aware of that behavior of xarray, and for me it seems dangerous, and not very consistent. What if I wanted to divide, for example, an array with a "distance" dimension and one with a "time" dimension. Would xarray do it or it would refuse as the dimensions do not match? What if the arrays have additional dimensions with similar names? For me that is not a desirable behavior.

What I like is simply an array type (supporting the full array API standard) that wraps other array types and adds named dimensions, that work across operations. It could be ok if they check that the dim name is the same for some operations, like adding two arrays, although even in that case it seems dubious to me (what if I wanted to sum dims "initialization time" and "computation time"? Honestly, I think that check should be done at the unit level and not at the name level).

@shoyer
Copy link
Contributor

shoyer commented Nov 3, 2023

Well, I was not aware of that behavior of xarray, and for me it seems dangerous, and not very consistent. What if I wanted to divide, for example, an array with a "distance" dimension and one with a "time" dimension. Would xarray do it or it would refuse as the dimensions do not match? What if the arrays have additional dimensions with similar names? For me that is not a desirable behavior.

If you divide an array with a "distance" dimension by an array with a "time" dimension in Xarray, you get an array with two dimensions: "distance" and "time".

Some other operations might will raise an error with different dimension names, but in no cases will dimensions names be ignored.

What I like is simply an array type (supporting the full array API standard) that wraps other array types and adds named dimensions, that work across operations. It could be ok if they check that the dim name is the same for some operations, like adding two arrays, although even in that case it seems dubious to me (what if I wanted to sum dims "initialization time" and "computation time"? Honestly, I think that check should be done at the unit level and not at the name level).

I agree, this is certainly a reasonable data structure to build, but it isn't Xarray. This is basically the model used by PyTorch's Named Tensor.

@asmeurer
Copy link
Member

asmeurer commented Nov 3, 2023

The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

I guess it's worth considering it just in the sense that the function names are standardized so it's useful to use them for the sake of consistency across libraries (this even for functions that otherwise have an axis parameter).

It's likely that most nontrivial code written against the array API won't work with named arrays but perhaps there is some code that will. Perhaps it would be instructive to take a look at scipy and scikit-learn's array API code and see whether it would otherwise work with xarray.

@rgommers
Copy link
Member

rgommers commented Nov 7, 2023

The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

I could interpret this question in two ways:

  1. Is it useful to implement a subset of the functionality with the same syntax and semantics as the array API standard has for that functionality?
  2. Should we add a __array_namespace__ method to xarray.NamedArray to declare compatibility even though we're only partially compatible?

I think the answer to (1) is yes, why not - a lot of effort went into making every function in the standard have a clean signature and well-defined semantics, and comparing with many array libraries to make sure it matches them as well as possible (unless there's a good reason not to).

The answer to (2) should be no I believe. That will be incorrect, and for libraries that try to write array-library-agnostic code it may lead to silently incorrect results.

@dcherian
Copy link

dcherian commented Nov 8, 2023

That will be incorrect, and for libraries that try to write array-library-agnostic code it may lead to silently incorrect results.

We could raise for every op that is not an elementwise function but that seems against the spirit of the standard. Would you agree?

@rgommers
Copy link
Member

rgommers commented Nov 8, 2023

Probably against the letter too, and not just the spirit - since there's many places where the wording is "must behave like X".

@dcherian
Copy link

dcherian commented Nov 8, 2023

Point taken! 😂

Thank you

@kgryte
Copy link
Contributor

kgryte commented Mar 21, 2024

As I don't think there is anything more to discuss here as pertains to revising the array API standard, I'll go ahead and close this.

@kgryte kgryte closed this as completed Mar 21, 2024
@kgryte kgryte removed the use case label Apr 4, 2024
@data-apis data-apis locked and limited conversation to collaborators Apr 4, 2024
@kgryte kgryte converted this issue into discussion #775 Apr 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants