Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in whether index is created with new dimension coordinate? #4417

Closed
johnomotani opened this issue Sep 10, 2020 · 6 comments
Closed

Comments

@johnomotani
Copy link
Contributor

It seems like set_coords() doesn't create an index variable. Is there a reason for this? I was surprised that the following code snippets produce different Datasets (first one has empty indexes, second one has x in indexes), even though both Datasets have a 'dimension coordinate' x:

(1)

import numpy as np
import xarray as xr

ds = xr.Dataset()
ds['a'] = ('x', np.linspace(0,1))
ds['b'] = ('x', np.linspace(3,4))
ds = ds.rename(b='x')
ds = ds.set_coords('x')

print(ds)
print('indexes', ds.indexes)

(2)

import numpy as np
import xarray as xr

ds = xr.Dataset()
ds['a'] = ('x', np.linspace(0,1))
ds['x'] = ('x', np.linspace(3,4))

print(ds)
print('indexes', ds.indexes)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4

xarray: 0.16.0
pandas: 1.1.1
numpy: 1.18.5
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.23.0
distributed: 2.25.0
matplotlib: 3.2.2
cartopy: None
seaborn: None
numbagg: None
pint: 0.13
setuptools: 49.6.0.post20200814
pip: 20.2.3
conda: 4.8.4
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1

johnomotani added a commit to boutproject/xBOUT that referenced this issue Sep 10, 2020
Workaround for possible bug with xarray,
pydata/xarray#4417, needed for ensuring
consistency so we can test for 'identical'.
@max-sixty
Copy link
Collaborator

Thanks for the clear example @johnomotani !

IIRC, the second creates an index because ds['x'] = ('x', np.linspace(3,4)) is setting a 1D variable where the name of the variable matches the name of its dimension.

I can see why this is a bit confusing, though it's not an unreasonable default imo. There have been some discussions on this on the issue tracker.

Does ds.set_index(x='b') align the examples? Or do you think there's still some confusion from the API?

@johnomotani
Copy link
Contributor Author

I haven't managed to get set_index to line up the results. This:

import numpy as np
import xarray as xr

ds = xr.Dataset()
ds['a'] = ('x', np.linspace(0,1))
ds['b'] = ('x', np.linspace(3,4))
ds = ds.rename(b='x')
ds = ds.set_coords('x')
ds = ds.set_index(x='x')

print(ds)
print('indexes', ds.indexes)

Produces a Dataset that now has an index x, but no x coordinate. The 'assignment of 1D variable' version produces both a coordinate and an index.

One suggested solution in #2461 was to use swap_dims(), and the following does produce the desired result

import numpy as np
import xarray as xr

ds = xr.Dataset()
ds['a'] = ('x', np.linspace(0,1))
ds['b'] = ('x', np.linspace(3,4))
ds = ds.swap_dims({'x': 'b'})
ds = ds.rename(b='x')

print(ds)
print('indexes', ds.indexes)

But (1) having to use swap_dims() to create a coordinate seems bizarre, (2) I think it's a bit ugly to get rid of the x dimension with swap_dims(), and then have to rename the new dimension back to x, when what I wanted was to add a coordinate to x.

I found the behaviour confusing because I wasn't aware of the index variables at all... I create a coordinate with set_coords(); then do a bunch of manipulations involving slicing pieces out of the Dataset and re-combining them with concat (at least I think this is the relevant part of my code...); finally test the result by taking slices again (with isel()) of the result and comparing them to slices of the original Dataset with xarray.testing.assert_identical(), which failed because of the missing indexes. I guess somewhere in the manipulations I did, some operation created the coordinate in a new Dataset object by assignment, at which point it generated an index for the coordinate too.

I may well be missing other context that makes this an undesirable thing to do, but for my use-case at least, I think it would make more sense if set_coords() created an index if the coordinate is a dimension coordinate (or whatever the actual criterion is for assignment of a 1d variable to create an index).

@dcherian
Copy link
Contributor

I think I was trying to fix this in #4108

@johnomotani
Copy link
Contributor Author

I think I was trying to fix this in #4108

👍 On a quick glance it looks to me like #4108 would fix my issue here.

@benbovy
Copy link
Member

benbovy commented Sep 7, 2022

From discussions in #4825 and elsewhere (e.g., #6607 (comment)), could we say that we're going towards a consensus that the index vs. coordinate API should be decoupled wherever possible?

If that's the case, then set_coords() shouldn't create any index, but it could be chained with set_index() if necessary (or swap_dims() can be used instead).

Apparently set_coords() never created an index? I tested @johnomotani's example back to v0.14. Before v2022.6.0, there was a source of confusion where the Dataset and DataArray reprs still showed an * for the dimension coordinate while it had no index. This is fixed in v2022.6.0 where the reprs are now consistent: no * and no index. Does this solve the issue here?

@johnomotani from your example it looks like the two alternatives below would produce the desired result with v2022.6.0:

  • ds.set_index(x='b')
  • ds.rename(b='x').set_coords('x').set_index(x='x')

I would encourage using the latter. It is more verbose but clearer (each method is explicit).

Note also that Xarray now also supports non-dimension coordinates with an index. Once #6971 is merged, renaming "b" to "x" won't be needed. Instead you'll be able to do something like:

ds = xr.Dataset()
ds['a'] = ('x', np.linspace(0,1))
ds['b'] = ('x', np.linspace(3,4))
ds = ds.set_coords('b').set_xindex('b')
ds.sel(b=...)

@johnomotani
Copy link
Contributor Author

Thanks @benbovy - with explicit methods now to produce the result with or without index, I think we can close this now 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants