scalar_level in MultiIndex #1426

fujiisoup · 2017-05-25T11:03:05Z

Closes .sel does not keep selected coordinate value in case with MultiIndex #1408
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

[Edit for more clarity]
I restarted a new branch to fix #1408 (I closed the older one #1412).

Because the changes I made is relatively large, here I summarize this PR.

Sumamry

In this PR, I newly added two kinds of levels in MultiIndex, index-level and scalar-level.
index-level is an ordinary level in MultiIndex (as in current implementation),
while scalar-level indicates dropped level (which is newly added in this PR).

Changes in behaviors.

Indexing a scalar at a particular level changes that level to scalar-level instead of dropping that level (changed from MultiIndex and data selection #767).
Indexing a scalar from a MultiIndex, the selected value now becomes a MultiIndex-scalar rather than a scalar of tuple.
Enabled indexing along a index-level if the MultiIndex has only a single index-level.

Examples of the output are shown below.
Any suggestions for these behaviors are welcome.

In [1]: import numpy as np
   ...: import xarray as xr
   ...: 
   ...: ds1 = xr.Dataset({'foo': (('x',), [1, 2, 3])}, {'x': [1, 2, 3], 'y': 'a'})
   ...: ds2 = xr.Dataset({'foo': (('x',), [4, 5, 6])}, {'x': [1, 2, 3], 'y': 'b'})
   ...: # example data
   ...: ds = xr.concat([ds1, ds2], dim='y').stack(yx=['y', 'x'])
   ...: ds
Out[1]: 
<xarray.Dataset>
Dimensions:  (yx: 6)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'  # <--- this is index-level
  - x        (yx) int64 1 2 3 1 2 3            # <--- this is also index-level
Data variables:
    foo      (yx) int64 1 2 3 4 5 6

In [2]: # 1. indexing a scalar converts `index-level` x to `scalar-level`.
   ...: ds.sel(x=1)
Out[2]: 
<xarray.Dataset>
Dimensions:  (yx: 2)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'b'  # <--- this is index-level
  - x        int64 1              # <--- this is scalar-level
Data variables:
    foo      (yx) int64 1 4

In [3]: # 2. indexing a single element from MultiIndex makes a `MultiIndex-scalar`
   ...: ds.isel(yx=0)
Out[3]: 
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    yx       MultiIndex          # <--- this is MultiIndex-scalar
  - y        <U1 'a'
  - x        int64 1
Data variables:
    foo      int64 1

In [6]: # 3. Enables to selecting along a `index-level` if only one `index-level` exists in MultiIndex
   ...: ds.sel(x=1).isel(y=[0,1])
Out[6]: 
<xarray.Dataset>
Dimensions:  (yx: 2)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'b'
  - x        int64 1
Data variables:
    foo      (yx) int64 1 4

Changes in the public APIs

Some changes were necessary to the public APIs, though I tried to minimize them.

level_names, get_level_values methods were moved from IndexVariable to Variable.
This is because IndexVariable cannnot handle 0-d array, which I want to support in 2.
scalar_level_names and all_level_names properties were added to Variable
reset_levels method was added to Variable class to control scalar-level and index-level.

Implementation summary

The main changes in the implementation is the addition of our own wrapper of pd.MultiIndex, PandasMultiIndexAdapter.
This does most of MultiIndex-related operations, such as indexing, concatenation, conversion between 'scalar-levelandindex-level`.

What we can do now

The main merit of this proposal is that it enables us to handle MultiIndex more consistent way to the normal Variable.
Now we can

recover the MultiIndex with dropped level.

In [5]: ds.sel(x=1).expand_dims('x')
Out[5]: 
<xarray.Dataset>
Dimensions:  (yx: 2)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'b'
  - x        (yx) int64 1 1
Data variables:
    foo      (yx) int64 1 4

construct a MultiIndex by concatenation of MultiIndex-scalar.

In [8]: xr.concat([ds.isel(yx=i) for i in range(len(ds['yx']))], dim='yx')
Out[8]: 
<xarray.Dataset>
Dimensions:  (yx: 6)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
  - x        (yx) int64 1 2 3 1 2 3
Data variables:
    foo      (yx) int64 1 2 3 4 5 6

What we cannot do now

With the current implementation, we can do

ds.sel(y='a').rolling(x=2)

but with this PR we cannot, because x is not yet an ordinary coordinate, but a MultiIndex with a single index-level.
I think it is better if we can handle such a MultiIndex with a single index-level as very similar way to an ordinary coordinate.

Similary, we can neither do ds.sel(y='a').mean(dim='x').
Also, ds.sel(y='a').to_netcdf('file') (#719)

What are to be decided

How to repr these new levels (Current formatting is shown in Out[2] and Out[3] above.)
Terminologies such as index-level, scalar-level, MultiIndex-scalar are clear enough?
How much operations should we support for a single index-level MultiIndex?
Do we support ds.sel(y='a').rolling(x=2) and ds.sel(y='a').mean(dim='x')?

TODOs

Support indexing with DataAarray, ds.sel(x=ds.x[0])
Support stack, unstack, set_index, reset_index methods with scalar-level MultiIndex.
Add a full document
Clean up the code related to MultiIndex
Fix issues (changes made to coords using groupby and apply do not persist #1428, setting values with getattr performs wrong opperation with multi-dimensional coordinate #1430, inconsistent behavior in stack/unstack along one dimension #1431) related to MultiIndex

fujiisoup · 2017-05-26T01:11:15Z

xarray/tests/test_dataset.py

+        Dimensions:  (x: 2)
+        Coordinates:
+          * x        (x) MultiIndex
+          - level_1  <U1 'a'


Now test fails here for Python2.7. Python2.7 seems to understand the dtype of str as |S1 not <U1. This should be solved but is not related to scalar-level of MultiIndex.

shoyer · 2017-05-30T05:29:11Z

Sorry for the delay getting back to you here -- I'm still thinking through the implications of this change.

This does make the handling of MultiIndex type data much more consistent, but calling scalars MultiIndex-scalar seems quite confusing to me. I think of the data-type here as closer to NumPy's structured types, except without the implied storage format for the data.

However, taking a step back, I wonder if this is the right approach. In many ways, structured dtypes are similar to xarray's existing data structures, so supporting them fully means a lot of duplicated functionality. MultiIndexes (especially with scalars) should work similarly to separate variables, but they are implemented very differently under the hood (all the data lives in one variable).

(See pandas-dev/pandas#3443 for related discussion about pandas and
why it doesn't support structured dtypes.)

It occurs to me that if we had full support for indexing on coordinate levels, we might not need a notion of a "MultiIndex" in the public API at all. To make this more concrete, what if this was the repr() for the result of ds.stack(yx=['y', 'x']) in your first example?

<xarray.Dataset>
Dimensions:  (yx: 6)
Coordinates:
    y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
    x        (yx) int64 1 2 3 1 2 3
Data variables:
    foo      (yx) int64 1 2 3 4 5 6

If we supported MultiIndex-like indexing for x and y, this could be nearly equivalent to a MultiIndex with much less code duplication. The important practical difference is that here there are no labels along the yx, so ds['yx'][0] would not return a tuple. Also, we would need to figure out some way to explicitly signal what should become part of a MultiIndex when we convert to a pandas DataFrame.

Pandas has MultiIndex because it needed a way to group multiple arrays together into a single index array. In xarray, this is less necessary, because we have multiple coordinates to represent levels, and xarray itself no longer need a MultiIndex notion because we longer requires coordinate labels for every dimension (as of v0.9).

CC @benbovy

fujiisoup · 2017-05-30T14:47:36Z

@shoyer Thanks for the comment.

It occurs to me that if we had full support for indexing on coordinate levels, we might not need a notion of a "MultiIndex" in the public API at all.

Actually I am not yet fully comfortable with my implementation,
and I like your idea as this might be much cleaner and simpler than mine.

If my understanding is correct, does it mean that we will support
ds.sel(x='a'), ds.isel(x=[0, 1]) and ds.mean(dim='x') with your example data?
Will it raise an Error if Coordinate is more than 1 dimensional?
How about ds.sel(x='a', y=[1, 2])?

fmaussion · 2017-05-30T14:53:48Z

It occurs to me that if we had full support for indexing on coordinate levels, we might not need a notion of a "MultiIndex" in the public API at all.

This would be awesome and so much clearer for many users including me, who understand "coordinates" much better than "MultiIndex".

benbovy · 2017-05-30T23:38:05Z

I also fully agree that using multiple coordinate (index) variables instead of a MultiIndex would greatly simplify things both internally and for users!

A dimension with a single 'real' coordinate (i.e., an IndexVariable) that warps a MultiIndex with multiple 'levels' that can be accessed (and indexed) as 'virtual' coordinates indeed represents a lot of unnecessary complexity!! A dimension having multiple 'real' coordinates that can be used with .sel - or even .isel - is much simpler to understand and maybe to implement.

Using multiple 'real' coordinates, I don't see any reason why ds.sel(x='a'), ds.isel(x=[0, 1]) or ds.sel(x='a', y=[1, 2]) would not be supported. However, we need to choose what to do in case of conflicts, e.g., ds.isel(x=[0, 1], y=[1, 2]). Raise an error? Return a result equivalent to ds.isel(yx=1)(and) or equivalent to ds.isel(x=[0, 1, 2]) (or)?

The important practical difference is that here there are no labels along the yx, so ds['yx'][0] would not return a tuple. Also, we would need to figure out some way to explicitly signal what should become part of a MultiIndex when we convert to a pandas DataFrame.

I'm thinking about something like this:

<xarray.Dataset>
Dimensions:  (yx: 6)
Coordinates:
  * yx       (yx) CoordinateGroup
  - y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
  - x        (yx) int64 1 2 3 1 2 3
Data variables:
    foo      (yx) int64 1 2 3 4 5 6

It may present several advantages:

Instead of being listed as a dimension without coordinates (which is not true), yx would have a CoordinateGroup that would simply consist of a lightweight object that only contains references to the x and y coordinates.
CoordinateGroup may behave like a virtual coordinate so that ds['yx'][0] still returns a tuple (there may not be many use cases for this, though).
set_index, reset_index and reorder_levels can still be used to explicitly create, modify or remove a CoordinateGroup for a given dimension.
It is trivial to convert a CoordinateGroup to a MultiIndex when we convert to a pandas DataFrame. According to @fmaussion's comment above, I think that using here a name like CoordinateGroup is much easier to understand for xarray users that using the name MultiIndex.
In repr(), x and y are still shown next to each other.

shoyer · 2017-05-31T01:46:35Z

If my understanding is correct, does it mean that we will support
ds.sel(x='a'), ds.isel(x=[0, 1]) and ds.mean(dim='x') with your example data?
Will it raise an Error if Coordinate is more than 1 dimensional?
How about ds.sel(x='a', y=[1, 2])?

I was only thinking about .sel() (as works currently with MultiIndex). I'm not sure about the others yet.

@benbovy although a CoordinateGroup is definitely better than MultiIndex-scalar, it still feels like a very similar notion. It could make for a nice internal clean-up, but from an user perspective I think it's about as confusing as a MultiIndex -- it's just as many terms to keep track of.

Right now, our user facing API in xarray exposes three related concepts:

Coordinate
Index
MultiIndex

Eliminating any of these concepts would be an improvement.

To this end, I have two (vague) proposals:

Eliminate MultiIndex. We only have an idea of "indexed" coordinates, marked by * in the repr, which don't necessarily correspond to dimensions. Indexed coordinates, which are immutable, can have any number of dimensions and you can have any other of "indexed" coordinates per dimension. Indexing, concatenating and expanding dimensions should not change their nature.
Eliminate both MultiIndex and explicit indexes. Indexes required for efficient operations are created on the fly when necessary. This might be too magical.

fujiisoup · 2017-05-31T02:12:13Z

@shoyer
I personally think 2 is more intuitive for users,
because it might be difficult to distinguish

<xarray.Dataset>
Dimensions:  (yx: 6)
Coordinates:
    y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
Data variables:
    foo      (yx) int64 1 2 3 4 5 6

(which may be generated by indexing from x in your example) from

<xarray.Dataset>
Dimensions:  (y: 6)
Coordinates:
 *  y        (y) object 'a' 'a' 'a' 'b' 'b' 'b'
Data variables:
    foo      (y) int64 1 2 3 4 5 6

What is the possible confusion if we adopt 2?

benbovy · 2017-06-01T15:00:06Z

@fujiisoup I agree that given your example proposal 2 might be more intuitive, however IMHO implicit indexes seem a bit too magical indeed. Although I don't have any concrete example in mind, I guess that sometimes I would be hard to really understand what's going on.

Exposing less concepts to users would be indeed an improvement, unless it makes things too implicit or magical.

Let me try to give a more detailed proposal than in my previous comment, which generalizes to potential features like multi-dimensional indexers (see @shoyer's comment, which I'd be happy to start working on soon).

It is actually very much like proposal 1, with only one additional concept (called "super index" below).

DataArray and Dataset objects may have coordinates, which are the variables listed in da.coords or ds.coords. These variables may be 1-dimensional or n-dimensional.
Among these coordinates, some are "indexed" coordinates. These are marked by * in the repr and can be used in .sel and .isel as keyword arguments.
Some coordinates may be grouped together and wrapped by some kinds of "super indexes". These super indexes are also marked by * in the repr and the coordinates that are part of it are shown next below with the - marker. Each coordinate wrapped by a super index is considered as an indexed coordinate: it is still listed in da.coords or ds.coords and it can be also used in .sel and .isel as keyword argument. This is different for the super index, which is not listed in .coords. If needed, we might make super indexes accessible as virtual coordinates: they would then return arrays of tuples with the values of the wrapped coordinates.

Examples of super indexes:

KDTree. It allows multi-dimensional coordinates to be indexed using a KDTree.
Similarly, BallTree or RTree...
MultiIndex (or CoordinateGroup or any better name). It allows to explicitly define multiple indexes for a given dimension and to explicitly define the behavior when for example we select data with conflicting labels in different coordinates. It also naturally converts to a pandas.MultiIndex when we want to convert to a DataFrame.

"Super index" is an additional concept that has to be understood by users, which is in principle bad, but here I think it's worth as it potentially gives a good generic model for explicit handling of various, advanced indexes that involve multiple coordinates.

fujiisoup · 2017-06-18T10:11:33Z

@benbovy
Sorry for my late reply.

I think I like your proposal, which bundles multiple concepts in xarray such as MultiIndex and multi-dimensional coordinates into one, which may result in simpler API.
But actually I don't yet fully imagine how your proposal works with multi-dimensional coordinates.
(maybe because I am not accustomed with multi-dimensional coordinates very well.)

Currently, 'rasm' example is like

In [1]: import xarray as xr
In [2]: xr.tutorial.load_dataset('rasm', decode_times=False)
Out[2]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes: 
   ...

Does your proposal (automatically) change this like

<xarray.Dataset>
Dimensions:  (time: 36, xy: 56375)
Coordinates:
  * time     (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
    xc       (xy) float64 189.2 189.0 188.7 188.5 188.2 187.9 187.7 187.4 ...
    yc       (xy) float64 16.53 16.69 16.85 17.01 17.17 17.32 17.48 17.63 ...
  * xy       (xy) SuperIndex
  - x        (xy) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (xy) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    Tair     (time, xy) float64 nan nan nan nan nan nan nan nan nan nan nan ...
Attributes:
   ...

?

benbovy · 2017-06-21T10:08:07Z

Although I haven't thought about all the details regarding this, I think that in the case of multi-dimensional coordinates a "super index" would rather allow directly using these coordinates for indexing, which is currently not possible.

In your 'rasm' example, it would rather look like

<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Dimensions without coordinates: y, x
Coordinates:
  * time     (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
  * spatial_index  (y, x) KDTree
  - xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
  - yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes: 
   ...

and it would allow writing

In [1]: ds.sel(xc=<...>, yc=<...>, method='nearest')

Note that x and y dimensions still don't have coordinates.

That's actually what @shoyer suggested here.

The proposal above is more about having the same API for groups of coordinates that can be indexed using a "wrapped" index object (maybe "wrapped index" is a better name than "super index"?), but the logic can be very different from one index object to another.

fujiisoup · 2019-01-14T21:20:27Z

I'll close this for the recent discussion about MultiIndex

fujiisoup added 8 commits May 24, 2017 22:54

Started implementation.

faec686

PandasMultiIndexAdapter added.

f905fd9

Removed MultiIndexVariable to keep backward compatibility of public API.

e099e02

Implemented .sel. Next, we need to add a test for .isel()

f6419d2

isel support for MultiIndex.

2cf6224

Pass all tests.

079f136

Remove unnecessary function.

7e544ab

Scalar levels becoms a _level_coords.

3c9866b

fujiisoup mentioned this pull request May 25, 2017

Multiindex scalar coords, fixes #1408 #1412

Closed

4 tasks

make test compatible to python2.7

ae8ec17

fujiisoup commented May 26, 2017

View reviewed changes

Make test compatible to python2 again.

2de5b3a

fujiisoup mentioned this pull request May 28, 2017

inconsistent behavior in stack/unstack along one dimension #1431

Closed

fujiisoup added 6 commits May 28, 2017 21:34

concat for multiindex is implemented.

94f977b

expand_dims now converts scalar-level to index-level.

f43ea3e

patch for flake8.

d1d5797

Make test compatible windows python 2

30d0d66

Fixing coordinate order to stabilize the test.

231f8e7

windows python2 support.

38dbbbc

shoyer mentioned this pull request Jul 12, 2017

Allow DataArray to hold cell boundaries as coordinate variables #1475

Open

jhamman added the design question label Jul 13, 2017

jhamman added enhancement topic-indexing needs work labels Jul 13, 2017

fujiisoup mentioned this pull request Jul 26, 2017

Selecting over a MultiIndex coordinate drops coordinate in question #1491

Closed

shoyer mentioned this pull request Jul 27, 2017

ENH: points coord from isel/sel_points should be a MultiIndex #1493

Closed

fujiisoup mentioned this pull request Oct 4, 2017

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

Closed

fujiisoup closed this Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalar_level in MultiIndex #1426

scalar_level in MultiIndex #1426

fujiisoup commented May 25, 2017 •

edited

Loading

fujiisoup May 26, 2017

shoyer commented May 30, 2017

fujiisoup commented May 30, 2017

fmaussion commented May 30, 2017

benbovy commented May 30, 2017

shoyer commented May 31, 2017

fujiisoup commented May 31, 2017

benbovy commented Jun 1, 2017

fujiisoup commented Jun 18, 2017

benbovy commented Jun 21, 2017 •

edited

Loading

fujiisoup commented Jan 14, 2019

scalar_level in MultiIndex #1426

scalar_level in MultiIndex #1426

Conversation

fujiisoup commented May 25, 2017 • edited Loading

Sumamry

Changes in behaviors.

Changes in the public APIs

Implementation summary

What we can do now

What we cannot do now

What are to be decided

TODOs

fujiisoup May 26, 2017

Choose a reason for hiding this comment

shoyer commented May 30, 2017

fujiisoup commented May 30, 2017

fmaussion commented May 30, 2017

benbovy commented May 30, 2017

shoyer commented May 31, 2017

fujiisoup commented May 31, 2017

benbovy commented Jun 1, 2017

fujiisoup commented Jun 18, 2017

benbovy commented Jun 21, 2017 • edited Loading

fujiisoup commented Jan 14, 2019

fujiisoup commented May 25, 2017 •

edited

Loading

benbovy commented Jun 21, 2017 •

edited

Loading