Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend xarray with custom "coordinate wrappers" #1961

Closed
benbovy opened this issue Mar 4, 2018 · 10 comments
Closed

Extend xarray with custom "coordinate wrappers" #1961

benbovy opened this issue Mar 4, 2018 · 10 comments

Comments

@benbovy
Copy link
Member

benbovy commented Mar 4, 2018

Recent and ongoing developments in xarray turn DataArray and Dataset more and more into data wrappers that are extensible at (almost) every level:

Regarding the latter, I’m thinking about the idea of extending xarray at an even more abstract level, i.e., the possibility of adding / registering "coordinate wrappers" to DataArray or Dataset objects. Basically, it would correspond to adding any object that allows to do some operation based on one or several coordinates (I haven’t found any better name than "coordinate agent" to describe that).

EDIT: "coordinate agents" may not be quite right here, I changed that to "coordinate wrappers")

Indexes are a specific case of coordinate wrappers that serve the purpose of indexing. This is built in xarray.

While indexing is enough in 80% of cases, I see a couple of use cases where other coordinate wrappers (built outside of xarray) would be nice to have:

  • Grids. For example, xgcm implements operations (interp, diff) on physical axes that may each include several coordinates, depending on the position of the coordinate labels on the axis (center, left…). Other grids define their topology using a greater number of coordinates (e.g., ugrid). Storing regridding weights might be another use case?
  • Clocks. For example, xarray-simlab use one or several coordinates to define the timeline of a computational simulation.

In those examples we usually rely on coordinate attributes and/or classes that encapsulate xarray objects to implement the specific features that we need. While it works, it has limitations and I think it can be improved.

Custom coordinate wrappers would be a way of extending xarray that is very consistent with other current (or considered) extension mechanisms.

This is still a very vague idea and I’m sure that there are lots of details that can be discussed (serialization, etc.).

But before going further, I’d like to know your thoughts @pydata/xarray. Do you think it is a silly idea? Do you have in mind other use cases where custom coordinate wrappers would be useful?

@benbovy
Copy link
Member Author

benbovy commented Mar 4, 2018

As an example, in xgcm we would have something like

>>> ds = ds_original.xgcm.generate(...)
>>> ds.xgcm.interp(‘var’, axis=X’)

instead of

>>> ds = xgcm.generate_grid_ds(ds_original, ...)
>>> grid = xgcm.Grid(ds)
>>> grid.interp(ds.var, axis=X’)

The advantage in the first example is that the information on the grid’s physical axes is bound to a Dataset object (as coordinate wrappers), so we don’t need to deal with any instance of another class (i.e., Grid in the second example) to perform grid operations like interpolation on a given axis, which can rather be implemented into a Dataset accessor (i.e., Dataset.xgcm in the first example).

@rabernat I don't have much experience with xgcm so maybe this isn't a good example?

I guess we could just use Dataset attributes and/or private instance attributes in the Dataset accessor class for that, but

  • coordinate attributes are not really made for storing complex information
  • attributes in the accessor class are lost when creating a new Dataset
  • important information like grid axes should be exposed to the user

@benbovy benbovy changed the title Extend xarray with custom "coordinate agents" Extend xarray with custom "coordinate wrappers" Mar 4, 2018
@shoyer
Copy link
Member

shoyer commented Mar 4, 2018

This has some similarity to what we would need for a KDTreeIndex (e.g., as discussed in #1603). If we can use the same interface for both, then it would be natural to support other "derived indexes", too.

What would the proposed interface be here?

@shoyer
Copy link
Member

shoyer commented Mar 4, 2018

I guess the common pattern for "coordinate wrappers"/"indexes" looks like:

  • They are derived from/associated with one or more coordinate variables.
  • Operations that preserve associated coordinates should also preserve coordinate wrappers. Conversely, operations that drop any associated coordinates should drop coordinate wrappers.
  • If associated coordinates are subset, coordinate wrappers can be lazily updated (in the worst case from scratch).
  • Serialization to disk netCDF entails losing coordinate wrappers, which will need to be recreated.
  • Coordinate wrappers may implement indexing for one or more coordinates.

Possible future features for coordinate wrappers:

  • A protocol for saving metadata to netCDF files to allow them to be automatically recreated when loading a file from disk.
  • Implementations for other indexing based operations, e.g., resampling or interpolation.

I'm open to other names, but my inclination would be to still call all of these indexes, even if they don't actually implement indexing.

@benbovy
Copy link
Member Author

benbovy commented Mar 4, 2018

I don't have a full idea yet of what would be the interface, but taking the repr() in your comment and mixing it with a a simplified version of an example of repr(xgcm.Grid) found in the docs, this could look like

<xarray.Dataset (exp_time: 5, x_c: 9, x_g: 9)>
Coordinates:
  * experiment  (exp_time) int64 0 0 0 1 1 
  * time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
  * x_g         (x_g) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
  * x_c         (x_c) int64 1 2 3 4 5 6 7 8 9
Indexes:
    exp_time: pandas.MultiIndex[experiment, time] 
Grid axes:
    X: xgcm.Axis[x_c, x_g]

Like Dataset.indexes returns all Index objects, Dataset.xgcm.grid_axes would return all xgcm.Axis objects.

Like Dataset.sel or Dataset.set_index use/act on indexes, Dataset.xgcm.interp or Dataset.xgcm.generate_grid would use/act on grid axes.

3rd-party coordinate wrappers thus make sense only if there is accessors to handle them.

If we add an indexes argument in Dataset and DataArray constructors, we might even think adding **kwargs as well in the constructors for, e.g., grid_axes. But I can see it is something that we probably don't want :-).

I use xgcm here because I think it is a nice example of application. This might co-exist with other pairs of custom coordinate wrappers / accessors.

More generally, on the xarray side we would need

  • a container (e.g., a dictionary) attached to Dataset or DataArray objects so that we can bind coordinate wrappers to them.
  • ensure that these are propagated correctly to new data objects.
  • maybe an AbstractCoordinateWrapper class that would provide a unified interface for dealing with issues of serialization, etc.

@benbovy
Copy link
Member Author

benbovy commented Mar 4, 2018

Agreed with all your points @shoyer.

I'm open to other names, but my inclination would be to still call all of these indexes, even if they don't actually implement indexing.

Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.

@shoyer
Copy link
Member

shoyer commented Mar 4, 2018

Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.

This seems messier to me. I would rather stick with adding a single OrderedDict to the data model for Dataset and DataArray.

Would it be that confusing to see an xgcm grid or xarray-simlab clock listed as in the repr as an "Index"? Letting third-party libraries add their own repr categories seems like possibly going too far.

@benbovy
Copy link
Member Author

benbovy commented Mar 4, 2018

Letting third-party libraries add their own repr categories seems like possibly going too far.

Yes you're probably right.

I can imagine in the example above that Dataset.xgcm.grid_axes returns a subset of a flat collection, for convenience.

It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis (and potentially other wrappers) can do things very different than Index classes, which may be confusing.

@benbovy
Copy link
Member Author

benbovy commented Mar 4, 2018

It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis (and potentially other wrappers) can do things very different than Index classes, which may be confusing.

That said, as real indexes cover most of the use cases, I'd by fine if we keep calling these indexes.

@stale
Copy link

stale bot commented Feb 2, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Feb 2, 2020
@jhamman jhamman removed the stale label Feb 4, 2020
@benbovy
Copy link
Member Author

benbovy commented Sep 19, 2022

I think we can close this issue. The flexible index refactor now provides a nice framework for the suggestions made here.

@benbovy benbovy closed this as completed Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants