Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.reset_index()/.reset_coords() maintain MultiIndex status #8743

Closed
5 tasks done
ks905383 opened this issue Feb 13, 2024 · 6 comments
Closed
5 tasks done

.reset_index()/.reset_coords() maintain MultiIndex status #8743

ks905383 opened this issue Feb 13, 2024 · 6 comments

Comments

@ks905383
Copy link
Contributor

What happened?

Trying to save a dataset to NetCDF using ds.to_netcdf() will fail when one of the coordinates is a multiindex. The error message suggests using .reset_index() to remove the multiindex. However, saving still fails after resetting the index, including after moving the offending coordinates to be data variables instead using .reset_coords().

What did you expect to happen?

After calling .reset_index(), and especially after calling .reset_coords(), the save should be successful.

As shown in the example below, a dataset that asserts identical to the dataset that throws the error saves without a problem. (this also points to a current workaround - to recreate the Dataset from scratch).

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# Create random dataset
ds = xr.Dataset({'test':(['lat','lon'],np.random.rand(2,3))},
                coords = {'lat':(['lat'],[0,1]),
                          'lon':(['lon'],[0,1,2])})

# Create multiindex by stacking
ds = ds.stack(locv=('lat','lon'))
# The index shows up as a MultiIndex
print(ds.indexes)

# Try to export (this fails as expected, since multiindex)
#ds.to_netcdf('test.nc')

# Now, get rid of multiindex by resetting coords (i.e., 
# turning coordinates into data variables)
ds = ds.reset_index('locv').reset_coords()

# The index is no longer a MultiIndex
print(ds.indexes)

# Try to export - this also fails! 
#ds.to_netcdf('test.nc')

# A reference comparison dataset that is successfully asserted
# as identical 
ds_compare = xr.Dataset({k:(['locv'],ds[k].values) for k in ds})
xr.testing.assert_identical(ds_compare,ds)

# Try exporting (this succeeds)
ds_compare.to_netcdf('test.nc')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[109], line 1
----> 1 ds.to_netcdf('test.nc')

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/core/dataset.py:2303, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   2300     encoding = {}
   2301 from xarray.backends.api import to_netcdf
-> 2303 return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
   2304     self,
   2305     path,
   2306     mode=mode,
   2307     format=format,
   2308     group=group,
   2309     engine=engine,
   2310     encoding=encoding,
   2311     unlimited_dims=unlimited_dims,
   2312     compute=compute,
   2313     multifile=False,
   2314     invalid_netcdf=invalid_netcdf,
   2315 )

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/backends/api.py:1315, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1310 # TODO: figure out how to refactor this logic (here and in save_mfdataset)
   1311 # to avoid this mess of conditionals
   1312 try:
   1313     # TODO: allow this work (setting up the file for writing array data)
   1314     # to be parallelized with dask
-> 1315     dump_to_store(
   1316         dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1317     )
   1318     if autoclose:
   1319         store.close()

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/backends/api.py:1362, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1359 if encoder:
   1360     variables, attrs = encoder(variables, attrs)
-> 1362 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/backends/common.py:352, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    349 if writer is None:
    350     writer = ArrayWriter()
--> 352 variables, attributes = self.encode(variables, attributes)
    354 self.set_attributes(attributes)
    355 self.set_dimensions(variables, unlimited_dims=unlimited_dims)

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/backends/common.py:441, in WritableCFDataStore.encode(self, variables, attributes)
    438 def encode(self, variables, attributes):
    439     # All NetCDF files get CF encoded by default, without this attempting
    440     # to write times, for example, would fail.
--> 441     variables, attributes = cf_encoder(variables, attributes)
    442     variables = {k: self.encode_variable(v) for k, v in variables.items()}
    443     attributes = {k: self.encode_attribute(v) for k, v in attributes.items()}

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/conventions.py:791, in cf_encoder(variables, attributes)
    788 # add encoding for time bounds variables if present.
    789 _update_bounds_encoding(variables)
--> 791 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    793 # Remove attrs from bounds variables (issue #2921)
    794 for var in new_vars.values():

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/conventions.py:179, in encode_cf_variable(var, needs_copy, name)
    157 def encode_cf_variable(
    158     var: Variable, needs_copy: bool = True, name: T_Name = None
    159 ) -> Variable:
    160     """
    161     Converts a Variable into a Variable which follows some
    162     of the CF conventions:
   (...)
    177         A variable which has been encoded as described above.
    178     """
--> 179     ensure_not_multiindex(var, name=name)
    181     for coder in [
    182         times.CFDatetimeCoder(),
    183         times.CFTimedeltaCoder(),
   (...)
    190         variables.BooleanCoder(),
    191     ]:
    192         var = coder.encode(var, name=name)

File ~/opt/anaconda3/envs/xagg_test2/lib/python3.12/site-packages/xarray/conventions.py:88, in ensure_not_multiindex(var, name)
     86 def ensure_not_multiindex(var: Variable, name: T_Name = None) -> None:
     87     if isinstance(var._data, indexing.PandasMultiIndexingAdapter):
---> 88         raise NotImplementedError(
     89             f"variable {name!r} is a MultiIndex, which cannot yet be "
     90             "serialized. Instead, either use reset_index() "
     91             "to convert MultiIndex levels into coordinate variables instead "
     92             "or use https://cf-xarray.readthedocs.io/en/latest/coding.html."
     93         )

NotImplementedError: variable 'lat' is a MultiIndex, which cannot yet be serialized. Instead, either use reset_index() to convert MultiIndex levels into coordinate variables instead or use https://cf-xarray.readthedocs.io/en/latest/coding.html.

Anything else we need to know?

This is a recent error that came up in some automated tests - an older version of it is still working; so xarray v2023.1.0 does not have this issue.

Given that saving works with a dataset that xr.testing.assert_identical() asserts is identical to the dataset that fails, and that ds.indexes() no longer shows a MultiIndex on the dataset that fails, perhaps the issue is in the error itself - i.e., in xarray.conventions.ensure_not_multiindex ?

Looks like it was added recently f9f4c73 to address another bug.

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:05:03) [Clang 16.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.3
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: 0.15.1
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 24.0
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.21.0
sphinx: None

@ks905383 ks905383 added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 13, 2024
@ks905383
Copy link
Contributor Author

Looks like it's related to the issue @benbovy raised in the #8672 discussion actually

ks905383 added a commit to ks905383/xagg that referenced this issue Feb 13, 2024
Workaround for issues stemming from pydata/xarray#8743

By replicating the source_grid dataset from scratch / from .values before exporting it.
@JustAmane
Copy link

I have the same issue as described above. Still working for xarray v2023.12.0

@dcherian dcherian added regression and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 22, 2024
@veni-vidi-vici-dormivi
Copy link

veni-vidi-vici-dormivi commented Mar 1, 2024

Same issue for me. Maybe related to #6946. Posting the same example here with the modification of not dropping the reset Multiindex:

import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')

ds = ds.stack(spatial=['lon', 'lat'])
ds = ds.reset_index('spatial') #, drop=True)  

ds.to_netcdf('test.nc')

Raises:
NotImplementedError: variable 'lon' is a MultiIndex, which cannot yet be serialized. Instead, either use reset_index() to convert MultiIndex levels into coordinate variables instead or use https://cf-xarray.readthedocs.io/en/latest/coding.html.

@mathause
Copy link
Collaborator

mathause commented Mar 1, 2024

I think this was fixed in #8672 and should now work again in the newest version (v2024.02)

@veni-vidi-vici-dormivi
Copy link

Yup can confirm. Thanks.

@ks905383
Copy link
Contributor Author

ks905383 commented Mar 1, 2024

Can also confirm this is fixed by #8672 , thanks y'all!

@ks905383 ks905383 closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants