Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add invert option to DataArray/Dataset.stack() #8332

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

carschandler
Copy link
Contributor

@carschandler carschandler commented Oct 18, 2023

This brings in the option to stack all dimensions except for one or more dimensions listed. I find this very useful for quickly iterating over all the combinations of dimensions except for one (i.e. you have a number of input parameters that are parameterized and one time dimension, and you want to calculate some time response for all the combinations of these input parameters and store them in the time-row corresponding to the appropriate combination of inputs).

I played around with implementing to_stacked_array() for DataArray, but this made less sense in the end since that method was really designed for Datasets.

  • Addresses A way to stack all dimensions but one? #8278
  • Tests added (added one for DataArray, just now realizing I should have one for Dataset)
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@welcome
Copy link

welcome bot commented Oct 18, 2023

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @carschandler ! I left some comments. Interested what others think too.

xarray/core/dataset.py Show resolved Hide resolved
for dim in dims:
if dim is Ellipsis:
raise ValueError(
"Ellipsis syntax cannot be used when invert=True"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Ellipsis syntax cannot be used when invert=True"
"Ellipsis cannot be used when invert=True"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment on lines +2759 to +2764
invert : bool, default: False
When `True`, all dimensions of the DataArray except for the sequence of
dimensions specified in the `dimensions` parameter will be stacked.
`dimensions` must have a length of 1 in this case, all dimensions listed
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor
the ellipsis syntaxes may be used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I empathize with the use case here. That said:

Low-ish confidence, but I'm not a great fan of a parameter which totally changes another parameter's meaning.

I'm not sure of a great way of doing it, though. Would adding this to .groupby possibly make sense?? That's generally how we "invert" the dimensions. Or another param except_dimensions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that producing side-effects on other parameters is icky. I'm not sure how well it fits in groupby though... I'm open to an except_dimensions parameter... do you think it would need to be mutually exclusive with dimensions? Also open to something like a stack_except method, but don't know how sensitive the project is to introducing new methods.

Copy link
Collaborator

@max-sixty max-sixty Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(deleted a comment I submitted too early)

I think except_dimensions could be a reasonable option. I don't love it, since this would be the only place in the library we use it.

Edit: this doesn't really work for group_by, because the semantics are different where the indexes aren't unique, would need something like group_over as suggested later. Am leaving the original comment here regardless


.groupby could maybe work:

original from the docstring:

        >>> stacked = arr.stack(newdim=("x", "y"))
        >>> stacked
        <xarray.DataArray (z: 2, newdim: 6)>
        array([[ 0,  2,  4,  6,  8, 10],
               [ 1,  3,  5,  7,  9, 11]])
        Coordinates:
          * z        (z) <U5 'alpha' 'beta'
          * newdim   (newdim) object MultiIndex
          * x        (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
          * y        (newdim) int64 0 1 2 0 1 2

Using groupby:

        >>> stacked = arr.groupby('z').stack(newdim=...)
        >>> stacked
        <xarray.DataArray (z: 2, newdim: 6)>
        array([[ 0,  2,  4,  6,  8, 10],
               [ 1,  3,  5,  7,  9, 11]])
        Coordinates:
          * z        (z) <U5 'alpha' 'beta'
          * newdim   (newdim) object MultiIndex
          * x        (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
          * y        (newdim) int64 0 1 2 0 1 2

It does kinda make sense, since the z is the groupby key, and remains as a dimension like its general usage (i.e. try with .sum if this isn't clear).

On reflection, I quite like it. But it's quite possible I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not bad... it's definitely not the first place I would look if I were trying to figure out how to do this because I typically think of using groupby for arranging groups of data with matching coordinate values, and this concept stands separately from coordinates, but maybe that's just my misconception of what groupby is. If we took it that route, I think we'd want to make sure to put a note in the DataArray/Dataset stack documentation about it.

Copy link
Collaborator

@max-sixty max-sixty Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone else @pydata/xarray have views on the API? I think two broad options (along with a third of "do nothing, folks can write it themselves):

  • The groupby option — potentially very nice, but also maybe too "out there"
  • The exclude_dims option — this would be the only method we have this (the others use .groupby), so it's not really orthogonal

There's another which is to make xr.ALL_DIMS into a class, and allow xr.ALL_DIMS.exclude('foo', 'bar'). But introduces new functionality for possibly a fairly small use-case.

Copy link
Member

@benbovy benbovy Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carschandler for the PR!

Maybe I'm too conservative, but do we really need some specific API for this case? Couldn't we just do

arr.stack(newdim=[d for d in arr.dims if d != "z"])

A potential source of confusion for Dataset.stack() is the order of the stacked dimensions given as argument, which may be important in some cases especially if a multi-index is built from the stacked coordinates (it is by default). The order of the dimensions of a Dataset has no particular meaning so I think it is always preferable to pass them in a more explicit way to .stack.

Copy link
Contributor Author

@carschandler carschandler Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you could! I think you can always make this argument though (couldn't you just use np.zeros + 1 instead of np.ones, np.zeros(a.shape) instead of np.zeros_like, and couldn't you just use a MultiIndex in a DataFrame or some numpy arrays in a dict to store multidimensional data?), and I’m in favor of adding conveniences when they don't obscure the rest of the code. If everyone agrees that this does indeed obscure the project, then I will gladly rest my case, but I think that the true value of a project like xarray is in how the data can be quickly and simply indexed, so I figured this would be a nice convenience since all the other ways of achieving it are kind of ugly.

I do agree that the lack of order on a Dataset does pose a problem for setting the MultiIndex order... that being said, I think that there are still use cases where the order may not matter. Thanks for your feedback.

MultiIndex([('a', 0),
('a', 1),
('a', 2),
('b', 0),
('b', 1),
('b', 2)],
name='z')
name='newdim')
>>> inverted_stack = arr.stack({"newdim": "z"}, invert=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a clearer example would be exactly the same as https://github.com/pydata/xarray/pull/8332/files#diff-386f35dda14ca12d315f2eb71f0ab35d7fdc73419b5dc5175ea5720f3206903bR2793, but with the addition of invert=True (or whatever construction we end up using).

(If copying the new output is annoying, we can use pytest-accept...)

Comment on lines +2821 to +2825
>>> np.all(inverted_stack.indexes["newdim"] == stacked.indexes["newdim"])
True
>>> np.all(inverted_stack == stacked)
<xarray.DataArray ()>
array(True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are great tests, but possibly less good examples...

@dcherian
Copy link
Contributor

Yeah perhaps group_over is the way: #324

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 20, 2023

Yeah perhaps group_over is the way: #324

I added a note to my comment above caveating my .groupby suggestion. .groupby has different semantics from "excluding these dimensions" when the indexes aren't unique, and so is not so great a suggestion. .group_over could work, though would be a bigger effort.

Currently ranked vote:

  • Offer xr.ALL_DIMS.exclude("foo") edit: since it needs to be an instance, would need to be xr.ALL_DIMS(exclude='foo') — it wouldn't be too difficult, since the code could be added to the infix_dims func and it would work everywhere. I don't love it, but it doesn't intrude on existing APIs
  • Do nothing — a shame, but a list comp is indeed not too bad. Could wait for .group_over
  • exclude_dims arg to .stack
  • invert

@carschandler
Copy link
Contributor Author

Yeah perhaps group_over is the way: #324

This is a good reference. Thanks!

@benbovy
Copy link
Member

benbovy commented Oct 21, 2023

Sorry @carschandler if you feel that my #8332 (comment) was dismissive, it wasn't intended to be so. I definitely don't want to discourage anyone making suggestions / contributions to improve Xarray, and I don't think that any suggestion would obscure the project!

I just wanted to say that in this specific case the addition of an argument like invert or exclude_dims, while it would certainly make some cases a bit more convenient and succinct, is problematic in several ways. Not only regarding the order of the dimensions of a Dataset, but also on how to handle the case where multiple stacked dimensions are given as kwargs to stack. I'm sure we could come up with some basic rules to easily handle those cases (e.g., do not allow multiple stack dim kwargs) but is it worth the trouble? About the ugliness of all other ways of achieving it, I respectfully disagree. While a list comp is slightly more verbose, it is easy to understand. At a first glance the .group_over or xr.ALL_DIMS-based alternatives look also good IMHO.

@carschandler
Copy link
Contributor Author

@benbovy oh no not at all! Hope you didn't think I had taken offense after reading my response. I did genuinely appreciate your feedback and fully understand your sentiment. I understand your concerns about the invert and exclude_dims options, and your point about the list comprehension is valid as well. I agree that pursuing one of the other alternatives mentioned is probably better. At the end of the day, I just think it would be nice to be able to quickly iterate over rows across one (or more) dimension(s); whether it's in stack or some other method is less important.

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 23, 2023

Thanks @carschandler ! Very much appreciate your attitude and approach here!

What do folks think about xr.ALL_DIMS(exclude=['foo', 'bar'])?

To do this, we would make xr.ALL_DIMS into a small class and store exclusions. And then in infix_dims, we'd check for the class and materialize them based on the current dims.

I don't love it — it's not that much cleaner than [d for d in ds.dims if d not in ['foo', 'bar']]. But it's also not intrusive, doesn't require widespread changes or maintenance — I don't see much downside (maybe another item on the Dims type definition; though maybe not since it's Hashable...)

@carschandler
Copy link
Contributor Author

carschandler commented Oct 23, 2023

@max-sixty I agree with all your points. Not as concise as it could be in group_over, but I do think that the syntax is still marginally cleaner than the list comprehension, and I think its meaning is more quickly conveyed to the reader. It also provides a canonical way to get all dims except one, and I always appreciate when a project gives me a way to do something; gives me more confidence in my code. On the other hand, it's nice to have pure equality between ... and ALL_DIMS, but I don't think that that's a good enough reason to reject it.

If it is much work, then I would say it's not worth worrying about it and that list comprehension (or even da.to_dataset().to_stacked_array()) would work for now until group_over is ready. If it's a simple change, I think it's a nice convenience.

@carschandler
Copy link
Contributor Author

One other thought is that when doing chained operations, list comprehensions can become cumbersome if what you want to stack is actually some indexed version of an array. Of course you could assign to another array intermediately, but once again this is clunkier and may not always create a view depending on your indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants