Multidimensional groupby #818

rabernat · 2016-04-06T04:14:37Z

Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays.

This PR overcomes that issue by using stack to collapse multiple dimensions in the group variable. A minimal example of the new functionality is

>>> da = xr.DataArray([[0,1],[2,3]], 
                coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
                        'lat': (['ny','nx'], [[10,10],[20,20]] )},
                dims=['ny','nx'])
>>> da.groupby('lon').sum()
<xarray.DataArray (lon: 3)>
array([0, 3, 3])
Coordinates:
  * lon      (lon) int64 30 40 50

This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc.

If you think this is worth pursuing, I would love some feedback.

The PR is not complete. Some items to address are

Create a specialized grouper to allow coarser bins. By default, if no grouper is specified, the GroupBy object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be da.groupby('lon', bins=range(-90,90)).
Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.)
A nice example for the docs.

shoyer · 2016-04-06T07:05:05Z

Yes, this is awesome! I had a vague idea that stack could make something like this possible but hadn't really thought it through.

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

shoyer · 2016-04-06T07:07:38Z

xarray/core/groupby.py

+            # pandas/hashtable.pyx in pandas.hashtable.Float64HashTable.get_labels (pandas/hashtable.c:10302)()
+            # pandas/hashtable.so in View.MemoryView.memoryview_cwrapper (pandas/hashtable.c:29882)()
+            # pandas/hashtable.so in View.MemoryView.memoryview.__cinit__ (pandas/hashtable.c:26251)()
+            # ValueError: buffer source array is read-only


yikes, this is messy. not much else we can do about it, though

Probably better to move this traceback to a followup issue and just reference it by number in the code. Otherwise it's pretty distracting.

Are you suggesting I file a new issue with pandas? The minimal example seems to be

a = np.arange(10) a.flags.writeable = False pd.factorize(a)

shoyer · 2016-04-06T07:31:32Z

This will need to unstack to handle .apply. That will be nice for things like normalization.

rabernat · 2016-04-06T14:04:20Z

This will need to unstack to handle .apply. That will be nice for things like normalization.

Can you clarify what you mean by this? At what point should the unstack happen?

With the current code, apply seems to work ok:

>>> da.groupby('lon').apply(lambda x : (x**2).sum())
<xarray.DataArray (lon: 3)>
array([0, 5, 9])
Coordinates:
  * lon      (lon) int64 30 40 50

But perhaps I am missing a certain use case you have in mind?

rabernat · 2016-04-06T14:09:43Z

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

I normally used numpy.digitize for this type of thing, but pandas.cut indeed seems like the obvious choice.

Should this go into a separate PR?

rabernat · 2016-04-06T15:05:54Z

Let me try to clarify what I mean in item 2:

Allow specification of which dims to stack.

Say you have the following dataset

>>> ds =  xr.Dataset(
        {'temperature': (['time','nx'], [[1,1,2,2],[2,2,3,3]] ),
         'humidity': (['time','nx'], [[1,1,1,1],[1,1,1,1]] )})

Now imagine you want to average humidity in temperature coordinates. (This might sound like a bizarre operation, but it is actually the foundation of a sophisticated sort of thermodynamic analysis.)

Currently this works as follows

>>> ds = ds.set_coords('temperature')
>>> ds.humidity.groupby('temperature').sum()
<xarray.DataArray 'humidity' (temperature: 3)>
array([2, 4, 2])
Coordinates:
  * temperature  (temperature) int64 1 2 3

However, this sums over all time. What if you wanted to preserve the time dependence, but replace the nx coordinate with temperature. I would like to be able to say

ds.humidity.groupby('temperature', group_over='nx').sum()

and get back a DataArray with dimensions ('time', 'temperature').

Maybe this is already possible with a sophisticated use of apply. But I don't see how to do it.

shoyer · 2016-04-06T16:13:01Z

(Oops, pressed the wrong button to close)

Can you clarify what you mean by this? At what point should the unstack happen?

Consider ds.groupby('latitude').apply(lambda x: x - x.mean()) or ds.groupby('latitude') - ds.groupby('latitude').mean() (these are two ways of writing the same thing). In each of these cases, the result of a groupby has the same dimensions as the original instead of replacing one or more of the original dimensions.

jhamman · 2016-04-06T17:09:31Z

@rabernat - I don't have much to add right now but I've very excited about this addition. Once you've filled in few more of the features, ping me and I'll give it a full review and will test it out in some applications we have in house.

rabernat · 2016-04-07T00:14:17Z

@shoyer I'm having a tough time figuring out where to put the unstacking logic...maybe you can give me some advice.

My first idea was to add a method to the GroupBy class called _maybe_unstack_array and make a call to it here. The problem with that approach is that the group iteration happens over Variables, not full DataArrays, which means that unstacking is harder to do. Would need to store lots of metadata about the stacked / unstacked dimension names, sizes, etc.

If you think that is the right approach, I will forge ahead. But maybe, as the author of both the groupby and stack / unstack logic, you can see an easier way.

shoyer · 2016-04-07T01:48:01Z

@rabernat That looks like exactly the right place to me.

We only use variables for the concatenation in the shortcut=True path. With shortcut=False, we use DataArray/Dataset objects. For now, get it working with shortcut=False (hard code it if necessary) and I can help figure out how to extend it to shortcut=True.

rabernat · 2016-04-07T17:00:36Z

xarray/core/groupby.py

+            self._unstacked_dims = orig_dims
+            # we also need to rename the group name to avoid a conflict when
+            # concatenating
+            group_name += '_groups'


This was something I had to add in order to handle the inevitable namespace conflict that happens when trying to concat the unstacked arrays in apply. They will inevitably have a coordinate with the same name as the groupby dimension (group.name), so I add a suffix to the group name.

I don't think this is necessarily a bad thing, since there is a conceptual difference between the multi-dimensional coordinate (e.g. lon) and the one-dimensional group (e.g. lon_groups).

rabernat · 2016-04-07T17:14:55Z

My new commit supports unstacking in apply with shortcut=True. However, the behavior is kind of weird, in a way that is unique to the multidimensional case.

Consider the behavior of the text case:

>>> da = xr.DataArray([[0,1],[2,3]],
        coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
                'lat': (['ny','nx'], [[10,10],[20,20]] ),},
        dims=['ny','nx'],
>>> da.groupby('lon').apply(lambda x : x - x.mean(), shortcut=False)
<xarray.DataArray (lon_groups: 3, ny: 2, nx: 2)>
array([[[ 0. ,  nan],
        [ nan,  nan]],

       [[ nan, -0.5],
        [ 0.5,  nan]],

       [[ nan,  nan],
        [ nan,  0. ]]])
Coordinates:
  * ny          (ny) int64 0 1
  * nx          (nx) int64 0 1
    lat         (lon_groups, ny, nx) float64 10.0 nan nan nan nan 10.0 20.0 ...
    lon         (lon_groups, ny, nx) float64 30.0 nan nan nan nan 40.0 40.0 ...
  * lon_groups  (lon_groups) int64 30 40 50

When unstacking, the indices that are not part of the group get filled with nans. We are not able to put these arrays back together into a single array.

Note that if we do not rename the group name here:
https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R121

Then we get an error here:
https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R407

ValueError: the variable 'lon' has the same name as one of its dimensions ('lon', 'ny', 'nx'), but it is not 1-dimensional and thus it is not a valid index

shoyer · 2016-04-07T17:39:00Z

xarray/core/groupby.py

-        applied = (maybe_wrap_array(arr, func(arr, **kwargs)) for arr in grouped)
+        applied = (self._maybe_unstack_array(
+                    maybe_wrap_array(arr,func(arr, **kwargs)))
+                    for arr in grouped)
        combined = self._concat(applied, shortcut=shortcut)
        result = self._maybe_restore_empty_groups(combined)


I think we want to unstack result once, not each array in applied

Yes, that seems obvious now! (But only in retrospect. ;)

I just tried putting the call to _maybe_unstack_arrays around combined.

But the problem is that the multi-index is no longer present in combined. If I add a call to print (combined.indexes), I get

stacked_ny_nx: Index([(0, 0), (0, 1), (1, 0), (1, 1)], dtype='object', name=u'stacked_ny_nx')

i.e. just a regular index. This makes unstacking impossible (or a lot harder). I guess this is related to #769.

Unfortunately I have to copy the group variable because of the pandas / cython bug referenced above.

So kind of stuck.

Never mind...I think I see the way forward.

I think the issue here is actually that Variable.concat with a multiindex does not preserve the multiindex.

shoyer · 2016-04-07T17:42:03Z

I think that if unstack things properly (only once instead of on each applied example) we should get something like this, alleviating the need for the new group name:

<xarray.DataArray (ny: 2, nx: 2)>
array([[ 0. ,  -0.5],
        [ 0.5,  0]])
Coordinates:
  * ny          (ny) int64 0 1
  * nx          (nx) int64 0 1

rabernat · 2016-04-07T20:03:48Z

I think I got it working.

rabernat · 2016-04-07T20:34:53Z

The travis build failure is a conda problem, not my commit.

rabernat · 2016-04-08T12:41:00Z

@shoyer regarding the binning, should I modify resample to allow for non-time dimensions? Or a new function?

shoyer · 2016-04-08T16:29:58Z

@rabernat I'm not quite sure resample is the right place to put this, given that we aren't resampling on an axis. Just opened a pandas issue to discuss: pandas-dev/pandas#12828

rabernat · 2016-04-08T17:39:10Z

I have tried adding a new keyword bins arg to groupby, which should accomplish what I want and more. (It will also work on regular one-dimensional groupby operations.)

The way it works is like this:

>>> ar = xr.DataArray(np.arange(4), dims='dim_0')
>>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
>>> ar.groupby('dim_0', bins=[2,4]).sum()
<xarray.DataArray (dim_0: 2)>
array([1, 5])
Coordinates:
  * dim_0    (dim_0) int64 2 4

The only problem is that it seems to overwrite the original dimension of the array! After calling groupby

>>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
  * dim_0    (dim_0) int64 2 4

I think that resample overcomes this issue by renaming the dimension:
https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L437

I guess something similar should be possible here...

shoyer · 2016-04-08T18:40:58Z

xarray/core/groupby.py

@@ -338,6 +356,11 @@ def lookup_order(dimension):
        new_order = sorted(stacked.dims, key=lookup_order)
        return stacked.transpose(*new_order)

+    def _restore_multiindex(self, combined):
+        if self._stacked_dim is not None and self._stacked_dim in combined.dims:
+            combined[self._stacked_dim] = self.group[self._stacked_dim]


is where we somehow modify the original object being grouped? maybe better to try using assign instead of mutation

Ok, I can try that.

But to clarify, we are just restoring the dimension to what it was supposed to be before the multiindex index got mangled by _concat.

Yeah, this doesn't seem like the likely source -- I'm just going out on a limb here. Also, we might want to actually fix this in concat eventually instead (that would be the more general solution).

rabernat · 2016-04-10T13:15:49Z

So I tracked down the cause of the original array dimensions being overwritten. It happens within _concat_shortcut here:
https://github.com/pydata/xarray/blob/master/xarray/core/groupby.py#L325

result._coords[concat_dim.name] = as_variable(concat_dim, copy=True)

At this point, self.obj gets modified directly.

@shoyer should I just focus on the case where shortcut==False? Or should I try to debug the _concat_shortcut method? Your inline comments ("don't worry too much about maintaining this method") suggest that it is not going to be around forever.

clarkfitzg · 2016-05-22T23:28:01Z

Ah, now I see what you were going for. More going on here than I realized. That's a nice plot :)

rabernat · 2016-05-26T12:42:20Z

Just a little update--I realized that calling apply on multidimensional binned groups fails when the group is not reduced. For example

ds.groupby_bins('lat', lat_bins).apply(lambda x: x - x.mean())

raises errors because of conflicting coordinates when trying to concat the results. I only discovered this when making my tutorial notebook. I think I know how to fix it, but I haven't had time yet.

So it is moving along... I am excited about this feature and am confident it can make it into the next release.

rabernat · 2016-06-05T14:47:12Z

@shoyer, @jhamman, could you give me some feedback on one outstanding issue with this PR? I am stuck on a kind of obscure edge case, but I really want to get this finished.

Consider the following groupby operation, which creates bins which are finer than the original coordinate. In other words, some bins are empty because there are too many bins.

dat = xr.DataArray(np.arange(4))
dim_0_bins = np.arange(0,4.5,0.5)
gb = dat.groupby_bins('dim_0', dim_0_bins)
print(gb.groups)

gives

{'(0.5, 1]': [1], '(2.5, 3]': [3], '(1.5, 2]': [2]}

If I try a reducing apply operation, e.g. gb.mean(), it works fine. However, if I do

gb.apply(lambda x: x - x.mean())

I get an error on the concat step

--> 433         combined = self._concat(applied, shortcut=shortcut)
... [long stack trace]
IndexError: index 3 is out of bounds for axis 1 with size 3

I'm really not sure what the "correct behavior" should even be in this case. It is not even possible to reconstitute the original data array by doing gb.apply(lambda x: x). The same problem arises when the groups do not span the entire coordinate (e.g. dim_0_bins = [1,2,3]).

Do you have any thoughts / suggestions? I'm not sure I can solve this issue right now, but I would at least like to have a more useful error message.

shoyer · 2016-06-06T05:23:24Z

I think I can fix this, by making concatenation work properly on index objects. Stay tuned...

rabernat · 2016-06-06T11:36:02Z

@shoyer: I'm not sure this is as simple as a technical fix. It is a design question.

With regular groupby, the groups are guaranteed so span the original coordinates exactly, so you can always put the original dataarrays back together from the groupby object, i.e. ds.groupby('dim_0').apply(lambda x: x).

With groupby_bins, the user specifies the bins and might do so in such a way that

there are empty groups
there are indices which don't belong to to any group

In both cases, it is not obvious to me what should happen when calling .apply(lambda x: x). Especially for the latter, I would probably want to raise an error informing the user that their bins are not sufficient to reconstitute the full index.

shoyer · 2016-06-06T15:45:49Z

Empty groups should be straightforward -- we should be able handle them.

Indices which don't belong to any group are indeed more problematic. I think we have three options here:

Raise an error when calling .groupby_bins(...)
Raise an error when calling .groupby_bins(...).apply(...)
Simply concatenate back together whatever items were grouped, and give up on the guarantee that the applying the identity function restores the original item.

I think my preference would be for option 3, though 1 or 2 could be reasonable work arounds for now (raising NotImplementedError), because 3 is likely to be a little tricky to implement.

shoyer · 2016-06-08T04:32:29Z

I think #875 should fix the issue with concatenating index objects.

rabernat · 2016-06-08T18:51:37Z

I think #875 should fix the issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch to get merged into master?

shoyer · 2016-06-08T18:58:45Z

Looks like I still have a bug (failing Travis builds). Let me see if I can
get that sorted out first.

On Wed, Jun 8, 2016 at 11:51 AM, Ryan Abernathey notifications@github.com
wrote:

I think #875 #875 should fix the
issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch
to get merged into master?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#818 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABKS1oaZfZ0P384eSGKIQ8-0fbyH8KDWks5qJw86gaJpZM4IAuQH
.

rabernat · 2016-07-06T14:50:42Z

I just rebased and updated this PR. I have not resolved all of the edge cases, such as what to do about non-reducing groupby_bins operations that don't span the entire coordinate. Unfortunately merging @shoyer's fix from #875 did not resolve this problem, at least not in a way that was obvious to me.

My feeling is that this PR in its current form introduces some very useful new features. For my part, I am eager to start using it for actual science projects. Multidimensional grouping is unfamiliar territory. I don't think every potential issue can be resolved by me right now via this PR--I don't have the necessary skills, nor can I anticipate every use case. I think that getting this merged and out in the wild will give us some valuable user feedback which will help figure out where to go next. Plus it would get exposed to developers with the skills to resolve some of the issues. By waiting much longer, we risk it going stale, since lots of other xarray elements are also in flux.

Please let me know what you think.

shoyer · 2016-07-06T15:56:54Z

xarray/core/common.py

@@ -343,6 +343,60 @@ def groupby(self, group, squeeze=True):
            group = self[group]
        return self.groupby_cls(self, group, squeeze=squeeze)

+    def groupby_bins(self, group, bins, right=True, labels=None, precision=3,
+            include_lowest=False, squeeze=True):


indent should match the opening parentheses

shoyer · 2016-07-06T16:00:54Z

@rabernat I agree. I have a couple of minor style/pep8 issues, and we need an entry for "what's new", but let's merge this. I can then play around a little bit with potential fixes.

shoyer · 2016-07-08T01:50:30Z

OK, merging.....

rabernat force-pushed the groupby_multidim branch from 28bc38f to 807001e Compare April 6, 2016 04:16

shoyer reviewed Apr 6, 2016
View reviewed changes

rabernat mentioned this pull request Apr 6, 2016

Factorize fails with read-only array pandas-dev/pandas#12813

Closed

shoyer closed this Apr 6, 2016

shoyer reopened this Apr 6, 2016

rabernat reviewed Apr 7, 2016
View reviewed changes

shoyer reviewed Apr 7, 2016
View reviewed changes

rabernat force-pushed the groupby_multidim branch from 5b64e6f to b0db8a8 Compare April 7, 2016 19:41

rabernat mentioned this pull request Apr 8, 2016

API for multi-dimensional resampling/regridding #486

Open

shoyer mentioned this pull request Apr 8, 2016

API: extend resample to handle non-datetime numeric axes? pandas-dev/pandas#12828

Open

shoyer reviewed Apr 8, 2016
View reviewed changes

shoyer mentioned this pull request Jun 8, 2016

Correctly concatenate pandas.Index objects #875

Merged

philippjfr mentioned this pull request Jun 14, 2016

Adding a grid based xarray data interface holoviz/holoviews#713

Merged

4 tasks

rabernat added 9 commits July 6, 2016 15:58

multidimensional groupby and binning

f5f69b8

added time dimension to multidim groupby tests

4c65a71

updated docs

016071c

fixed binning

e879e8c

add groupby_bins method

82e4878

doc update

b94f1b8

test for non-monotonic 2d coordinates

d614246

bin coordinate name changed

f957eb8

updated docs and example

237fc39

rabernat force-pushed the groupby_multidim branch from f8ff81a to 237fc39 Compare July 6, 2016 14:31

shoyer reviewed Jul 6, 2016
View reviewed changes

fixed style issues and whats-new

dc50064

shoyer merged commit a0a3860 into pydata:master Jul 8, 2016

rabernat mentioned this pull request Sep 21, 2016

Groupby exclude dimension #1013

Open

Multidimensional groupby #818

Multidimensional groupby #818

Conversation

rabernat commented Apr 6, 2016 • edited by shoyer Loading

shoyer commented Apr 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Apr 6, 2016

rabernat commented Apr 6, 2016

rabernat commented Apr 6, 2016

rabernat commented Apr 6, 2016

shoyer commented Apr 6, 2016

jhamman commented Apr 6, 2016

rabernat commented Apr 7, 2016

shoyer commented Apr 7, 2016

Choose a reason for hiding this comment

rabernat commented Apr 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Apr 7, 2016

rabernat commented Apr 7, 2016

rabernat commented Apr 7, 2016

rabernat commented Apr 8, 2016

shoyer commented Apr 8, 2016

rabernat commented Apr 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Apr 10, 2016

clarkfitzg commented May 22, 2016

rabernat commented May 26, 2016

rabernat commented Jun 5, 2016

shoyer commented Jun 6, 2016

rabernat commented Jun 6, 2016

shoyer commented Jun 6, 2016

shoyer commented Jun 8, 2016

rabernat commented Jun 8, 2016

shoyer commented Jun 8, 2016

rabernat commented Jul 6, 2016

Choose a reason for hiding this comment

shoyer commented Jul 6, 2016

shoyer commented Jul 8, 2016

rabernat commented Apr 6, 2016 •

edited by shoyer

Loading