Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32- vs 64-bit coordinates coordinates in where() #6573

Open
4 tasks done
forman opened this issue May 5, 2022 · 6 comments
Open
4 tasks done

32- vs 64-bit coordinates coordinates in where() #6573

forman opened this issue May 5, 2022 · 6 comments

Comments

@forman
Copy link

forman commented May 5, 2022

What happened?

I'm struggling whether this is a bug or not. At least I faced a very unexpected behaviour.

For two given data arrays a and b with same dimensions and equal coordinates, c for c = a.where(b) should have equal dimensions and coordinates.

However if the coordinates of a have dtype of float32 and those of b are float64, then the dimension sizes of c will always be two. Of course, this way the coordinates of a and b are no longer exactly equal, but from a user perspective they represent the same labels.

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension. Allowing to explicitly pass indexes may help here, see #6392.

What did you expect to happen?

In the case described above, the dimensions and coordinates of c should be equal to a (and b).

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
c64 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

c3 = c32.where(c64 > 0.5)
assert len(c32) == len(c3)

v32 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c32))
v64 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c64))

v3 = v32.where(v64 > 0.5)
assert len(v32) == len(v3)
# --> Assertion error, Expected :10, Actual :2

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:17:03) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: None LOCALE: ('de_DE', 'cp1252') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.21.6 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.11.3 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.04.1 distributed: 2022.4.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: None sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: 7.1.2 IPython: 8.2.0 sphinx: None
@forman forman added bug needs triage Issue that has not been reviewed by xarray team member labels May 5, 2022
@max-sixty
Copy link
Collaborator

This does seem very odd. Does anyone have any ideas? As per @forman , changing

-c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
+c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

...causes the assertion to pass.

I'm not sure using floats as indexes is great, but I wouldn't have expected the results to be like this...

@benbovy
Copy link
Member

benbovy commented May 5, 2022

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension.

Yes that's right:

v32.indexes["x"].intersection(v64.indexes["x"])
# Float64Index([0.0, 1.0], dtype='float64', name='x')

I think the issue is more general than where() and relates to the alignment of Xarray objects with 32- vs 64-bit indexed coordinates:

xr.align(c32, c64)
# (<xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333334, 0.44444445,
#         0.5555556 , 0.6666667 , 0.7777778 , 0.8888889 , 1.        ],
#        dtype=float32)
#  Dimensions without coordinates: x,
#  <xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
#         0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])
#  Dimensions without coordinates: x)

xr.align(v32.x, v64.x)
# (<xarray.DataArray 'x' (x: 2)>
#  array([0., 1.], dtype=float32)
#  Coordinates:
#    * x        (x) float64 0.0 1.0,
#  <xarray.DataArray 'x' (x: 2)>
#  array([0., 1.])
#  Coordinates:
#    * x        (x) float64 0.0 1.0)

A possible solution would be to handle this special case internally by converting one of the index according to the dtype of the coordinate labels of the other index, similarly to what we are currently doing for the labels that are passed to .sel() (#3153). This should be pretty easy to implement in PandasIndex.join() I think.

However, I'm also wondering whether or not we should consider this as a bug. It would make sense to have a behavior that is consistent with .sel(), even though it is not a free nor transparent operation (implicit creation of a temporary pandas index). But how about .equals()? I.e.,

v32.x.equals(v64.x)
# False  -- Should we return True here?

This would be quite weird and wouldn't match the Xarray, Pandas and Numpy behavior below:

v32.indexes["x"].equals(v64.indexes["x"])
# False

c64.equals(c32)
# False

np.all(c32.values == c64.values)
# False

@max-sixty
Copy link
Collaborator

It could be coherent to have:

  • v32.x.equals(v64.x) be false — the indexes themselves aren't the same
  • the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I could also imagine raising an error here and having the user coerce the type. That seems less surprising that the current situation. Other languages don't allow floats to be compared for equality at all...

@dcherian
Copy link
Contributor

dcherian commented May 5, 2022

Maybe we should add an explicit join kwarg, so the safe thing to specify is join="exact"

@forman
Copy link
Author

forman commented May 16, 2022

the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I like that.

@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label May 17, 2022
@benbovy
Copy link
Member

benbovy commented Sep 28, 2022

I also like the idea of alignment with some tolerance. There is an open PR #4489, which needs to be reworked in the context of the explicit index refactor.

Alternatively to a new kwarg we could add an index build option, e.g., ds.set_xindex("x", index_cls=PandasIndex, align_tolerance=1e-6), but then it is not obvious how to handle different tolerance values given for the indexes to compare. Maybe this could depend on the given join method? E.g., pick the smallest tolerance for join=inner, the largest for join=outer, the tolerance of the left index for join=left, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Would enable this
Development

No branches or pull requests

4 participants