xarray contrib module #1850

shoyer · 2018-01-22T19:50:08Z

Overall, I think the xarray community could really benefit from some kind of centralized contrib package which has a low barrier to entry for these kinds of functions.

Yes, I agree that we should explore this. There are a lot of interesting projects building on xarray now but not great ways to discover them.

Are there other open source projects with a good model we should copy here?

Scikit-Learn has a separate GitHub org/repositories for contrib projects: https://github.com/scikit-learn-contrib.
TensorFlow has a contrib module within the TensorFlow namespace: tensorflow.contrib

This gives us two different models to consider. The first "separate repository" model might be easier/flexible from a maintenance perspective. Any preferences/thoughts?

There's also some nice overlap with the Pangeo project.

The text was updated successfully, but these errors were encountered:

nbren12 · 2018-01-22T21:25:53Z

Thanks for starting this issue @shoyer. One thing I would be interested to know is how sklearn and tensorflow balance code-quality and API consistency with low barrier to entry. For instance, most of the sklearn contrib packages provide classes which inherit from sklearn's Transformer, BaseEstimator, or Regressor classes, which ensures that all the contrib packages share a common interface.

benbovy · 2018-01-23T01:21:42Z

I like the idea of regrouping contrib projects.

I'd be +1 for the "separate repository" model, which looks indeed easier from a maintenance perspective. However, with this model it might probably be a good thing to also follow some package naming convention (see #1447 for discussion) so that we could easily identify contrib projects in, e.g., import statements or with package managers. I don't have strong opinion on this, though. Maybe it is too restrictive...

... which ensures that all the contrib packages share a common interface.

I'd see xarray contrib packages mainly provide Dataset or DataArray accessors that are too domain-specific to be added as "core" methods.

benbovy · 2018-01-23T01:50:23Z

Some additional thoughts:

One thing that I like with contrib modules "protected" within the xarray namespace is that it would really help us choosing module names that are short, relevant and ideally the same that the Dataset or DataArray accessors they provide.

However, it is likely that contrib modules may need domain-specific dependencies other than the ones used in xarray "core". With the xarray.contrib model we may end up with a lot of optional dependencies, which may be annoying, e.g., for ci or packaging with conda-forge. To me it would be too restrictive not allowing such specific dependencies in contrib projects.

shoyer · 2018-01-23T22:54:51Z

I think domain specific dependencies are a pretty decisive argument in favor of the separate repository model.

TensorFlow doesn't relax its code quality standards for contrib packages -- it's more about reducing guarantees of API stability or maintenance. That works OK for TensorFlow in part because the authors of most contrib packages are Google software engineers.

gajomi · 2018-01-23T23:01:11Z

I don't have any strong opinion about separate repos or contrib submodules, so long as there is some way to improve discoverability of methods.

Having said that, many of the methods mentioned in #1288 are in the numpy namespace, and at least naively applicable to all domains. Would you consider numpy methods with semantics compatible with DataArrays and/or Datasets as appropriate to contribute to core xarray?

nbren12 · 2018-01-23T23:09:21Z

I agree that the separate repository model is probably best. However, should it be in just one repository or in many?

Using many repos would solve the domain-specific dependency problem, but the sklearn-contrib packages are not that discoverable IMO. I found two of them via google on separate occasions before realizing that they were part of the same github organization.

benbovy · 2018-01-23T23:36:23Z

should it be in just one repository or in many?

One repository for all contrib projects would be hard to maintain if we allow very specific projects, like a little xarray extension to work with the 'xyz' GCM model (which seems to be a common case for extensions). That said, it doesn't prevent us from adding bigger, generic repositories like xarray-scipy.

but the sklearn-contrib packages are not that discoverable IMO.

Hence the suggestion to choose some convention for package naming, e.g., something similar to dask related packages: dask-learn, dask-glm, dask-xgboost, etc.

benbovy · 2018-01-23T23:54:33Z

To make methods even more discoverable, we might also add the x prefix to DataArray or Dataset accessors. This would work quite well with auto-completion, even though x alone is very often used as coordinate. Like suggested in #1447, we could have something like

$ conda install xarray-scipy -c conda-forge`

>>> import xarray as xr
>>> import xscipy

>>> da = xr.DataArray(...)
>>> da.xscipy.method()

But maybe that's too much x...

jhamman · 2018-01-24T06:15:08Z

My 2-cents. I think we could consider setting up an xarray-contrib organization. I don't see how a xr.contrib namespace buys us all that much, except for some additional book-keeping in the core xarray package. My thought would be to let individual projects decide 1) if they want to reside inside the xarray-contrib organization, and 2) whether or not to use the accessor api available in xarray now. We could easily add a page to the xarray docs that points to a collection of projects.

Side note, we don't have to use it but I did grab the xarray-contrib organization name just in case.

max-sixty · 2018-02-22T23:34:39Z

Re the comment from @benbovy

Even before this, let's put a list of projects that are closely integrated with xarray somewhere?

nbren12 · 2018-02-24T05:23:04Z

@maxim-lian There is a very short list of such packages hidden in the xarray documention.

In general, there are a ton of these awesome-... repos floating around the internet which just list the useful/related tools/libraries which are related to ... . For example, there are repos out there like awesome-python and awesome-bash. Maybe someone could start an awesome-xarray package.

shoyer · 2018-02-24T05:30:01Z

Personally I'd rather have "awesome xarray" listed somewhere prominently in the xarray docs, along with mentions inline in the docs anywhere where they are particularly relevant . The very short list that is currently there is based upon a handful of projects that I knew about a few years ago, but it's definitely woefully out of date now.

…

On Fri, Feb 23, 2018 at 9:23 PM Noah D Brenowitz ***@***.***> wrote: @maxim-lian <https://github.com/maxim-lian> There is a very short list of such packages hidden in the xarray documention <http://xarray.pydata.org/en/stable/internals.html?highlight=xgcm#extending-xarray> . In general, there are a ton of these awesome-... repos floating around the internet which just list the useful/related tools/libraries which are related to ... . For example, there are repos out there like awesome-python <https://github.com/vinta/awesome-python> and awesome-bash. Maybe someone could start an awesome-xarray package. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1850 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1oUunEGU95WyDsCgTYpuXKdybftIks5tX5y4gaJpZM4RoiXN> .

rabernat · 2019-04-04T19:41:59Z

FYI, we have started https://github.com/pangeo-data/awesome-open-climate-science. It is not xarray specific, but contains many xarray-related packages. Please contribute!

nbren12 · 2019-04-04T21:04:37Z

Thanks @rabernat that awesome list looks pretty awesome.

However, I would still advocate for a more centralized approach to this problem. For instance, the NCL has a huge library of contributed functions which they distribute along with the code. By now, I am sure that xarray users have basically reimplemented equivalents to all of these functions, but without a centralized home it is still too difficult to find or contribute new codes.

For instance, I have a useful wrapper to scipy.ndimage that I use all the time, but it seems overkill to release/support a whole package for this one module. I would be much more likely to contribute a PR to a community run repository. I am also much more likely to use such a repo.

I would be more than willing to volunteer for such an effort, but I think it needs to involve multiple people. Various individuals have tried to make such repos on their own, but none seem to have reached critical mass. For example,
https://github.com/crusaderky/xarray_extras
https://github.com/fujiisoup/xr-scipy
I think there should be multiple maintainers, so that if one person drops out, there still appears to be activity on the repo.

rabernat · 2019-04-04T23:40:12Z

Just to add to the mix, we have our own package for spectra! https://xrft.readthedocs.io/en/latest/

…

On Apr 4, 2019, at 5:04 PM, Noah D Brenowitz ***@***.***> wrote: Thanks @rabernat that awesome list looks pretty awesome. However, I would still advocate for a more centralized approach to this problem. For instance, the NCL has a huge library of contributed functions which they distribute along with the code. By now, I am sure that xarray users have basically reimplemented equivalents to all of these functions, but without a centralized home it is still too difficult to find or contribute new codes. For instance, I have a useful wrapper to scipy.ndimage that I use all the time, but it seems overkill to release/support a whole package for this one module. I would be much more likely to contribute a PR to a community run repository. I am also much more likely to use such a repo. I would be more than willing to volunteer for such an effort, but I think it needs to involve multiple people. Various individuals have tried to make such repos on their own, but none seem to have reached critical mass. For example, https://github.com/crusaderky/xarray_extras https://github.com/fujiisoup/xr-scipy I think there should be multiple maintainers, so that if one person drops out, there still appears to be activity on the repo. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

teoliphant · 2019-04-15T16:50:59Z

A few comments:

You need to have separately managed repos as you don't want the natural limits of group organization to bottleneck and limit the growth of the ecosystem (there is a reason SciPy broke-up into scikits --- and it hasn't gone far enough)
xarray should reify its API as soon as possible and own it (you may be too late to pull-back on ill-advised APIs already).
A simple list like awesome xarray in a Github repo that is referenced by xarray docs goes a long way towards a discoverable set of packages and helping people find each other. A namespace like xscipy would also work (but see next comment).
We are working on producing scipy-like libraries that can work on arbitrary arrays (we call this informally uscipy). Perhaps uscipy and xscipy can join forces and define interfaces that assume labels may exist.
xarray can be a nice intermediate between up-stream scipy-like libraries, and implementation details like NumPy or xnd (or even dask). I'm quite sure that xarray could be backed by the low-level libraries in xnd (and it is a goal of xnd to support projects like xarray).
Long term, we at Quansight Labs are working on getting an array protocol into Python core itself. I suspect we should get labels put into that definition from the beginning, and will need feedback from this community to make that happen. Timing for this is a PEP by end of 2019. If someone is eager to work on this now, it could go faster.

shoyer · 2019-04-15T16:57:08Z

For what it's worth, TensorFlow has decided that bundling contrib modules into TensorFlow as tensorflow.contrib was a big mistake. It helped with discoverability, but resulted in a lot of confusion about what is a supported API and what isn't.

shoyer · 2019-04-15T17:03:27Z

@teoliphant thanks for sharing your thoughts!

I would be very happy to collaborate on what a protocol for labeled arrays in Python could look like. Xarray is one useful implementations of labeled arrays, but it's definitely not the only one.

nbren12 · 2019-04-15T17:22:37Z

I'd also like to thank @teoliphant for weighing in!

Bearing in mind the history of scipy, I agree that the xarray community doesn't need 100% centralization, but there should be some conglomeration. IMO, the current situation of "one graduate student/postdoc per package" is not sustainable.

rabernat · 2019-04-15T17:40:07Z

The approach we have been taking is to develop "micro-packages". We currently have three:

xgcm - for finite volume cell operations on top of xarray DataArrays
xrft - for coordinate-aware Fourier transforms of Xarray DataArrays
xhistogram - (this one is brand new) - for multidimensional histograms applied along specified axes

These packages share some common design principles. In particular, they are all fully lazy and dask-friendly, meaning that we can apply them to very large datasets (which is the main focus in our group). By keeping the packages small, they are more maintainable. Xgcm and Xrft probably have O(3) active contributors, primarily myself and grad students in my group. Small, but significantly different from 1. We use these packages heavily in everyday scientific work, so I know they are useful.

I would love to combine forces on a larger effort. However, we have limited time and effort. For now, however, this situation doesn't seem too bad. It's kind of compatible with what @teoliphant was suggesting in his comment 1 above. I'm not sure that some mega xarray-contrib package would have critical mass to be sustainable either.

nbren12 · 2019-04-15T18:41:22Z

To be clear, I think there is some optimal middle ground between the "mega xarray-contrib" package and the current situation. I think the "micro-package" approach works when the collection of micro-packages is being maintained by an active/permanent entity (e.g. Ryan research group). On the other hand, postdocs and grad students are very likely to leave the field entirely within a few years, at which point they will probably stop maintaining their "micro-packages".

rabernat · 2019-04-15T18:43:45Z

@nbren12 - the key difference for our micro-packages is that the primary maintainer is me, not my grad students, and I'm not going anywhere for now. 😉

I still agree that there is probably a better way to organize all of this. Just trying to share our perspective as an xarray-centric small research group.

andersy005 · 2019-12-10T21:56:19Z

The gentlest of bumps on this. Any updates or progress here?? 😄 A couple of us @NCAR ( Cc @kmpaul, @matt-long ) are interested in the outcome of this issue.

dcherian · 2019-12-10T22:15:17Z

@andersy005 what kind of update are you looking for? I assume you are about to implement some general functionality but what to know where to put it?

andersy005 · 2019-12-10T22:55:13Z

I assume you are about to implement some general functionality but what to know where to put it?

This is correct.

One of the things we've been exploring is a "general resample utility" that would both enable fluid translation between data at different temporal intervals (this is one of the use cases) and be aware of things like time boundary variable . The fundamental concepts here are analogous to ESMF's regridder's:

Create a source axis, i.e. the axis that your original data is on,
Create a destination axis, i.e. the axis that you want to convert your data to,
Create an AxisRemapper object by passing the source and destination axis you created previously,
Finally, convert your data from the source axis to the destination axis, using the AxisRemapper object you created in previous step.

We have a general, low-level prototype in https://github.com/coderepocenter/AxisUtilities. We think that it would be beneficial to have this functionality in xarray instead of it residing in yet another xarray related package.

For the time being, my main question is: where (in xarray) would something like this reside?

Note:

I am happy to open a separate issue to discuss the merits of having this functionality in xarray.

Cc @maboualidev

shoyer mentioned this issue Jan 22, 2018

Add trapz to DataArray for mathematical integration #1288

Closed

jhamman mentioned this issue Feb 20, 2018

how to harness community enthusiasm pangeo-data/pangeo#121

Closed

benbovy mentioned this issue Feb 22, 2018

Representing & checking Dataset schemas #1900

Open

aymeric-spiga mentioned this issue Apr 3, 2018

simple command line interface for xarray #2034

Closed

fujiisoup mentioned this issue Apr 24, 2018

New feature: interp1d #2079

Closed

dcherian mentioned this issue May 11, 2018

Add "awesome xarray" list to faq. #2118

Merged

rabernat mentioned this issue Apr 4, 2019

Support real='coord' xgcm/xrft#57

Closed

rabernat mentioned this issue Apr 17, 2019

Weekly checkin meeting 2019-04-17 4:00pm EST pangeo-data/pangeo#595

Closed

rabernat mentioned this issue Sep 30, 2019

Implement polyfit? #3349

Closed

spencerkclark mentioned this issue Dec 22, 2019

interp with long cftime coordinates raises an error #3641

Closed

jhamman mentioned this issue May 8, 2020

Sustainability Plan for xESMF JiaweiZhuang/xESMF#98

Open

jhamman mentioned this issue Jul 22, 2020

update docs to point to xarray-contrib and xarray-tutorial #4252

Merged

1 task

dcherian closed this as completed in #4252 Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray contrib module #1850

xarray contrib module #1850

shoyer commented Jan 22, 2018

nbren12 commented Jan 22, 2018 •

edited

Loading

benbovy commented Jan 23, 2018 •

edited

Loading

benbovy commented Jan 23, 2018

shoyer commented Jan 23, 2018

gajomi commented Jan 23, 2018

nbren12 commented Jan 23, 2018

benbovy commented Jan 23, 2018

benbovy commented Jan 23, 2018 •

edited

Loading

jhamman commented Jan 24, 2018

max-sixty commented Feb 22, 2018

nbren12 commented Feb 24, 2018

shoyer commented Feb 24, 2018 via email

rabernat commented Apr 4, 2019

nbren12 commented Apr 4, 2019

rabernat commented Apr 4, 2019 via email

teoliphant commented Apr 15, 2019 •

edited

Loading

shoyer commented Apr 15, 2019

shoyer commented Apr 15, 2019

nbren12 commented Apr 15, 2019

rabernat commented Apr 15, 2019

nbren12 commented Apr 15, 2019

rabernat commented Apr 15, 2019

andersy005 commented Dec 10, 2019

dcherian commented Dec 10, 2019

andersy005 commented Dec 10, 2019

xarray contrib module #1850

xarray contrib module #1850

Comments

shoyer commented Jan 22, 2018

nbren12 commented Jan 22, 2018 • edited Loading

benbovy commented Jan 23, 2018 • edited Loading

benbovy commented Jan 23, 2018

shoyer commented Jan 23, 2018

gajomi commented Jan 23, 2018

nbren12 commented Jan 23, 2018

benbovy commented Jan 23, 2018

benbovy commented Jan 23, 2018 • edited Loading

jhamman commented Jan 24, 2018

max-sixty commented Feb 22, 2018

nbren12 commented Feb 24, 2018

shoyer commented Feb 24, 2018 via email

rabernat commented Apr 4, 2019

nbren12 commented Apr 4, 2019

rabernat commented Apr 4, 2019 via email

teoliphant commented Apr 15, 2019 • edited Loading

shoyer commented Apr 15, 2019

shoyer commented Apr 15, 2019

nbren12 commented Apr 15, 2019

rabernat commented Apr 15, 2019

nbren12 commented Apr 15, 2019

rabernat commented Apr 15, 2019

andersy005 commented Dec 10, 2019

dcherian commented Dec 10, 2019

andersy005 commented Dec 10, 2019

nbren12 commented Jan 22, 2018 •

edited

Loading

benbovy commented Jan 23, 2018 •

edited

Loading

benbovy commented Jan 23, 2018 •

edited

Loading

teoliphant commented Apr 15, 2019 •

edited

Loading