Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray contrib module #1850

Closed
shoyer opened this issue Jan 22, 2018 · 25 comments · Fixed by #4252
Closed

xarray contrib module #1850

shoyer opened this issue Jan 22, 2018 · 25 comments · Fixed by #4252

Comments

@shoyer
Copy link
Member

shoyer commented Jan 22, 2018

Over in #1288 @nbren12 wrote:

Overall, I think the xarray community could really benefit from some kind of centralized contrib package which has a low barrier to entry for these kinds of functions.

Yes, I agree that we should explore this. There are a lot of interesting projects building on xarray now but not great ways to discover them.

Are there other open source projects with a good model we should copy here?

  • Scikit-Learn has a separate GitHub org/repositories for contrib projects: https://github.com/scikit-learn-contrib.
  • TensorFlow has a contrib module within the TensorFlow namespace: tensorflow.contrib

This gives us two different models to consider. The first "separate repository" model might be easier/flexible from a maintenance perspective. Any preferences/thoughts?

There's also some nice overlap with the Pangeo project.

@nbren12
Copy link
Contributor

nbren12 commented Jan 22, 2018

Thanks for starting this issue @shoyer. One thing I would be interested to know is how sklearn and tensorflow balance code-quality and API consistency with low barrier to entry. For instance, most of the sklearn contrib packages provide classes which inherit from sklearn's Transformer, BaseEstimator, or Regressor classes, which ensures that all the contrib packages share a common interface.

@benbovy
Copy link
Member

benbovy commented Jan 23, 2018

I like the idea of regrouping contrib projects.

I'd be +1 for the "separate repository" model, which looks indeed easier from a maintenance perspective. However, with this model it might probably be a good thing to also follow some package naming convention (see #1447 for discussion) so that we could easily identify contrib projects in, e.g., import statements or with package managers. I don't have strong opinion on this, though. Maybe it is too restrictive...

... which ensures that all the contrib packages share a common interface.

I'd see xarray contrib packages mainly provide Dataset or DataArray accessors that are too domain-specific to be added as "core" methods.

@benbovy
Copy link
Member

benbovy commented Jan 23, 2018

Some additional thoughts:

One thing that I like with contrib modules "protected" within the xarray namespace is that it would really help us choosing module names that are short, relevant and ideally the same that the Dataset or DataArray accessors they provide.

However, it is likely that contrib modules may need domain-specific dependencies other than the ones used in xarray "core". With the xarray.contrib model we may end up with a lot of optional dependencies, which may be annoying, e.g., for ci or packaging with conda-forge. To me it would be too restrictive not allowing such specific dependencies in contrib projects.

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2018

I think domain specific dependencies are a pretty decisive argument in favor of the separate repository model.

TensorFlow doesn't relax its code quality standards for contrib packages -- it's more about reducing guarantees of API stability or maintenance. That works OK for TensorFlow in part because the authors of most contrib packages are Google software engineers.

@gajomi
Copy link
Contributor

gajomi commented Jan 23, 2018

I don't have any strong opinion about separate repos or contrib submodules, so long as there is some way to improve discoverability of methods.

Having said that, many of the methods mentioned in #1288 are in the numpy namespace, and at least naively applicable to all domains. Would you consider numpy methods with semantics compatible with DataArrays and/or Datasets as appropriate to contribute to core xarray?

@nbren12
Copy link
Contributor

nbren12 commented Jan 23, 2018

I agree that the separate repository model is probably best. However, should it be in just one repository or in many?

Using many repos would solve the domain-specific dependency problem, but the sklearn-contrib packages are not that discoverable IMO. I found two of them via google on separate occasions before realizing that they were part of the same github organization.

@benbovy
Copy link
Member

benbovy commented Jan 23, 2018

should it be in just one repository or in many?

One repository for all contrib projects would be hard to maintain if we allow very specific projects, like a little xarray extension to work with the 'xyz' GCM model (which seems to be a common case for extensions). That said, it doesn't prevent us from adding bigger, generic repositories like xarray-scipy.

but the sklearn-contrib packages are not that discoverable IMO.

Hence the suggestion to choose some convention for package naming, e.g., something similar to dask related packages: dask-learn, dask-glm, dask-xgboost, etc.

@benbovy
Copy link
Member

benbovy commented Jan 23, 2018

To make methods even more discoverable, we might also add the x prefix to DataArray or Dataset accessors. This would work quite well with auto-completion, even though x alone is very often used as coordinate. Like suggested in #1447, we could have something like

$ conda install xarray-scipy -c conda-forge`
>>> import xarray as xr
>>> import xscipy
>>> da = xr.DataArray(...)
>>> da.xscipy.method()

But maybe that's too much x...

@jhamman
Copy link
Member

jhamman commented Jan 24, 2018

My 2-cents. I think we could consider setting up an xarray-contrib organization. I don't see how a xr.contrib namespace buys us all that much, except for some additional book-keeping in the core xarray package. My thought would be to let individual projects decide 1) if they want to reside inside the xarray-contrib organization, and 2) whether or not to use the accessor api available in xarray now. We could easily add a page to the xarray docs that points to a collection of projects.

Side note, we don't have to use it but I did grab the xarray-contrib organization name just in case.

@max-sixty
Copy link
Collaborator

Re the comment from @benbovy

Even before this, let's put a list of projects that are closely integrated with xarray somewhere?

@nbren12
Copy link
Contributor

nbren12 commented Feb 24, 2018

@maxim-lian There is a very short list of such packages hidden in the xarray documention.

In general, there are a ton of these awesome-... repos floating around the internet which just list the useful/related tools/libraries which are related to ... . For example, there are repos out there like awesome-python and awesome-bash. Maybe someone could start an awesome-xarray package.

@shoyer
Copy link
Member Author

shoyer commented Feb 24, 2018 via email

@rabernat
Copy link
Contributor

rabernat commented Apr 4, 2019

FYI, we have started https://github.com/pangeo-data/awesome-open-climate-science. It is not xarray specific, but contains many xarray-related packages. Please contribute!

@nbren12
Copy link
Contributor

nbren12 commented Apr 4, 2019

Thanks @rabernat that awesome list looks pretty awesome.

However, I would still advocate for a more centralized approach to this problem. For instance, the NCL has a huge library of contributed functions which they distribute along with the code. By now, I am sure that xarray users have basically reimplemented equivalents to all of these functions, but without a centralized home it is still too difficult to find or contribute new codes.

For instance, I have a useful wrapper to scipy.ndimage that I use all the time, but it seems overkill to release/support a whole package for this one module. I would be much more likely to contribute a PR to a community run repository. I am also much more likely to use such a repo.

I would be more than willing to volunteer for such an effort, but I think it needs to involve multiple people. Various individuals have tried to make such repos on their own, but none seem to have reached critical mass. For example,
https://github.com/crusaderky/xarray_extras
https://github.com/fujiisoup/xr-scipy
I think there should be multiple maintainers, so that if one person drops out, there still appears to be activity on the repo.

@rabernat
Copy link
Contributor

rabernat commented Apr 4, 2019 via email

@teoliphant
Copy link
Member

teoliphant commented Apr 15, 2019

A few comments:

  1. You need to have separately managed repos as you don't want the natural limits of group organization to bottleneck and limit the growth of the ecosystem (there is a reason SciPy broke-up into scikits --- and it hasn't gone far enough)

  2. xarray should reify its API as soon as possible and own it (you may be too late to pull-back on ill-advised APIs already).

  3. A simple list like awesome xarray in a Github repo that is referenced by xarray docs goes a long way towards a discoverable set of packages and helping people find each other. A namespace like xscipy would also work (but see next comment).

  4. We are working on producing scipy-like libraries that can work on arbitrary arrays (we call this informally uscipy). Perhaps uscipy and xscipy can join forces and define interfaces that assume labels may exist.

  5. xarray can be a nice intermediate between up-stream scipy-like libraries, and implementation details like NumPy or xnd (or even dask). I'm quite sure that xarray could be backed by the low-level libraries in xnd (and it is a goal of xnd to support projects like xarray).

  6. Long term, we at Quansight Labs are working on getting an array protocol into Python core itself. I suspect we should get labels put into that definition from the beginning, and will need feedback from this community to make that happen. Timing for this is a PEP by end of 2019. If someone is eager to work on this now, it could go faster.

@shoyer
Copy link
Member Author

shoyer commented Apr 15, 2019

For what it's worth, TensorFlow has decided that bundling contrib modules into TensorFlow as tensorflow.contrib was a big mistake. It helped with discoverability, but resulted in a lot of confusion about what is a supported API and what isn't.

@shoyer
Copy link
Member Author

shoyer commented Apr 15, 2019

@teoliphant thanks for sharing your thoughts!

I would be very happy to collaborate on what a protocol for labeled arrays in Python could look like. Xarray is one useful implementations of labeled arrays, but it's definitely not the only one.

@nbren12
Copy link
Contributor

nbren12 commented Apr 15, 2019

I'd also like to thank @teoliphant for weighing in!

Bearing in mind the history of scipy, I agree that the xarray community doesn't need 100% centralization, but there should be some conglomeration. IMO, the current situation of "one graduate student/postdoc per package" is not sustainable.

@rabernat
Copy link
Contributor

The approach we have been taking is to develop "micro-packages". We currently have three:

  • xgcm - for finite volume cell operations on top of xarray DataArrays
  • xrft - for coordinate-aware Fourier transforms of Xarray DataArrays
  • xhistogram - (this one is brand new) - for multidimensional histograms applied along specified axes

These packages share some common design principles. In particular, they are all fully lazy and dask-friendly, meaning that we can apply them to very large datasets (which is the main focus in our group). By keeping the packages small, they are more maintainable. Xgcm and Xrft probably have O(3) active contributors, primarily myself and grad students in my group. Small, but significantly different from 1. We use these packages heavily in everyday scientific work, so I know they are useful.

I would love to combine forces on a larger effort. However, we have limited time and effort. For now, however, this situation doesn't seem too bad. It's kind of compatible with what @teoliphant was suggesting in his comment 1 above. I'm not sure that some mega xarray-contrib package would have critical mass to be sustainable either.

@nbren12
Copy link
Contributor

nbren12 commented Apr 15, 2019

To be clear, I think there is some optimal middle ground between the "mega xarray-contrib" package and the current situation. I think the "micro-package" approach works when the collection of micro-packages is being maintained by an active/permanent entity (e.g. Ryan research group). On the other hand, postdocs and grad students are very likely to leave the field entirely within a few years, at which point they will probably stop maintaining their "micro-packages".

@rabernat
Copy link
Contributor

@nbren12 - the key difference for our micro-packages is that the primary maintainer is me, not my grad students, and I'm not going anywhere for now. 😉

I still agree that there is probably a better way to organize all of this. Just trying to share our perspective as an xarray-centric small research group.

@andersy005
Copy link
Member

The gentlest of bumps on this. Any updates or progress here?? 😄 A couple of us @NCAR ( Cc @kmpaul, @matt-long ) are interested in the outcome of this issue.

@dcherian
Copy link
Contributor

@andersy005 what kind of update are you looking for? I assume you are about to implement some general functionality but what to know where to put it?

@andersy005
Copy link
Member

I assume you are about to implement some general functionality but what to know where to put it?

This is correct.

One of the things we've been exploring is a "general resample utility" that would both enable fluid translation between data at different temporal intervals (this is one of the use cases) and be aware of things like time boundary variable . The fundamental concepts here are analogous to ESMF's regridder's:

  • Create a source axis, i.e. the axis that your original data is on,
  • Create a destination axis, i.e. the axis that you want to convert your data to,
  • Create an AxisRemapper object by passing the source and destination axis you created previously,
  • Finally, convert your data from the source axis to the destination axis, using the AxisRemapper object you created in previous step.

We have a general, low-level prototype in https://github.com/coderepocenter/AxisUtilities. We think that it would be beneficial to have this functionality in xarray instead of it residing in yet another xarray related package.

For the time being, my main question is: where (in xarray) would something like this reside?

Note:

I am happy to open a separate issue to discuss the merits of having this functionality in xarray.

Cc @maboualidev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants