Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update contains_cftime_datetimes to avoid loading entire variable array #7494

Merged
merged 23 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ Bug fixes
- :py:func:`xarray.Dataset.to_zarr` now drops variable encodings that have been added by xarray during reading
a dataset. (:issue:`7129`, :pull:`7500`).
By `Hauke Schulz <https://github.com/observingClouds>`_.
- Improved performance in ``open_dataset`` for datasets with large object arrays (:issue:`7484`, :pull:`7494`).
dcherian marked this conversation as resolved.
Show resolved Hide resolved
By `Alex Goodman <https://github.com/agoodm>`_.

Documentation
~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion xarray/core/accessor_dt.py
Original file line number Diff line number Diff line change
Expand Up @@ -574,7 +574,7 @@ def __new__(cls, obj: T_DataArray) -> CombinedDatetimelikeAccessor:
# we need to choose which parent (datetime or timedelta) is
# appropriate. Since we're checking the dtypes anyway, we'll just
# do all the validation here.
if not _contains_datetime_like_objects(obj):
if not _contains_datetime_like_objects(obj.variable):
raise TypeError(
"'.dt' accessor only available for "
"DataArray with datetime64 timedelta64 dtype or "
Expand Down
36 changes: 19 additions & 17 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
ScalarOrArray,
SideOptions,
T_DataWithCoords,
T_Variable,
)
from xarray.core.variable import Variable

Expand Down Expand Up @@ -1770,31 +1771,32 @@ def is_np_timedelta_like(dtype: DTypeLike) -> bool:
return np.issubdtype(dtype, np.timedelta64)


def _contains_cftime_datetimes(array) -> bool:
def _contains_cftime_datetimes(array: Any) -> bool:
"""Check if an array contains cftime.datetime objects"""
if cftime is None:
return False
from xarray.core.variable import Variable

if isinstance(array, Variable):
var = array
else:
if array.dtype == np.dtype("O") and array.size > 0:
sample = np.asarray(array).flat[0]
if is_duck_dask_array(sample):
sample = sample.compute()
if isinstance(sample, np.ndarray):
sample = sample.item()
return isinstance(sample, cftime.datetime)
else:
return False
var = Variable(dims=tuple(f"dim_{v}" for v in range(array.ndim)), data=array)

return contains_cftime_datetimes(var)


def contains_cftime_datetimes(var) -> bool:
def contains_cftime_datetimes(var: T_Variable) -> bool:
"""Check if an xarray.Variable contains cftime.datetime objects"""
if var.dtype == np.dtype("O") and var.size > 0:
return _contains_cftime_datetimes(var.data)
else:
if cftime is None:
dcherian marked this conversation as resolved.
Show resolved Hide resolved
return False

if var.dtype == np.dtype("O") and var.size > 0:
first_idx = (0,) * var.ndim
sample = var[first_idx]
return isinstance(sample.to_numpy().item(), cftime.datetime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean. It'd be nice to add some sort of test like DuckBackendArrayWrapper in https://github.com/pydata/xarray/pull/6874/files . __getitem__ should raise if it will return more than one value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this. I am a little confused for what you are suggesting here. Are you looking for a simple test in test_variable.py that applies the same logic in this block to extract the very first element via Variable.__getitem__ here and check that it returns one value, a more general contains_cftime_datetimes test, or both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sotry that was a bit complicated and intended for IIlviljan.

I pushed a commit with a test. I also changed the code to account for those lazily indexed backend arrays explicitly.


return False


def _contains_datetime_like_objects(var) -> bool:
def _contains_datetime_like_objects(var: T_Variable) -> bool:
"""Check if a variable contains datetime like objects (either
np.datetime64, np.timedelta64, or cftime.datetime)
"""
Expand Down