Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'drop_duplicates' behaves differently when using 1 vs many coordinates for an index #8499

Open
5 tasks done
jbweston opened this issue Dec 1, 2023 · 4 comments
Open
5 tasks done

Comments

@jbweston
Copy link

jbweston commented Dec 1, 2023

What happened?

I am trying to drop_duplicates from a DataArray based on the values of some of the coordinates,
starting from a DataArray with coordinates, but no indexes.

To accomplish this, I call 'DataArray.set_xindex' with the appropriate coordinate names,
and then call 'drop_duplicates' on the resulting DataArray, like so:
 

from xarray import DataArray
import numpy as np

test_array = DataArray(
    np.random.rand(5), 
    coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)),
    dims="sample",
)

# output DataArray's 'sample' dimension has length 2, as expected
good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample")
assert len(good) == 2

The above functions as expected; 'good' has had its duplicates dropped,
and we are left with a DataArray of length 2.

However, the following does not function as I would expect:

# All the 'y's are '-1', so we expect the same duplicates as before to be dropped,
# even if we don't include the 'y' values in the index.
bad = test_array.set_xindex("x").drop_duplicates("sample")
# But this assert fails! 'drop_duplicates' does not drop anything
assert not bad.equals(test_array)

What did you expect to happen?

I expected drop_duplicates to drop the duplicates when I was using only a single coordinate for the index.

Minimal Complete Verifiable Example

from xarray import DataArray
import numpy as np

test_array = DataArray(
    range(5), 
    coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)),
    dims="sample",
)

# output DataArray's 'sample' dimension has length 2, as expected
good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample")
# And indeed there are only 2 elements left after dropping duplicates.
assert len(good) == 2

# All the 'y's are '-1', so we expect the same duplicates as before to be dropped,
bad = test_array.drop_vars("y").set_xindex("x").drop_duplicates("sample")
# But this assert fails! 'drop_duplicates' does not drop anything
assert not bad.equals(test_array.drop_vars("y"))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.133.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.1

xarray: 2023.11.0
pandas: 2.1.0
numpy: 1.24.4
scipy: 1.11.2
netCDF4: 1.6.3
pydap: None
h5netcdf: 1.2.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
iris: None
bottleneck: None
dask: 2023.9.1
distributed: 2023.9.1
matplotlib: 3.7.2
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.1.2
pip: 23.2.1
conda: 23.7.3
pytest: 7.4.2
mypy: None
IPython: 8.15.0
sphinx: None

@jbweston jbweston added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 1, 2023
Copy link

welcome bot commented Dec 1, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@jbweston
Copy link
Author

jbweston commented Dec 1, 2023

FWIW I suspect this behavior is due to the different code-paths that are followed when constructing single-coordinate vs multiple-coordinate indexes in set_xindex.

@benbovy
Copy link
Member

benbovy commented Dec 1, 2023

Thanks for the report @jbweston. I think that drop_duplicates has not been refactored yet to fully support multiple coordinate indexes to yield consistent behavior with the recent explicit index refactor in Xarray. I added it to the list in #6293.

@benbovy benbovy added topic-indexing and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 1, 2023
@benbovy
Copy link
Member

benbovy commented Dec 1, 2023

To bring more context, .drop_duplicates("sample") relies on .get_index("sample") which itself returns either:

  • the (pandas) index of the "sample" dimension coordinate
  • a pandas.RangeIndex if no "sample" dimension coordinate is found.

Since test_array.set_xindex(["x", "y"]) creates a dimension coordinate "sample" together with the multi-index, drop_duplicates works as expected (note: we intend to deprecate the creation of that dimension coordinate for a pandas multi-index).

However, test_array.set_xindex(["x"]) creates an index for the "x" coordinate but keeps its name unchanged, i.e. it doesn't create a "sample" dimension coordinate. .get_index("sample") returns a RangeIndex by default and .drop_duplicates doesn't behave as expected.

We need to refactor .drop_duplicates such that it takes into account that having multiple indexed (dimension or non-dimension) coordinates along a common dimension is now possible in Xarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do
Development

No branches or pull requests

2 participants