Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetime handling seems broken #9387

Closed
5 tasks done
ThomWorm opened this issue Aug 20, 2024 · 8 comments
Closed
5 tasks done

datetime handling seems broken #9387

ThomWorm opened this issue Aug 20, 2024 · 8 comments
Labels
plan to close May be closeable, needs more eyeballs topic-cftime

Comments

@ThomWorm
Copy link

ThomWorm commented Aug 20, 2024

What happened?

I've recently run into a few datetime issues with xarray. I've provided two separate reproducible examples below that seem to be connected to the same issue.

What did you expect to happen?

#issue 1 - I expect to get out an xarray containing the number of days between my input array (january 5th) and my target day (january first). Instead, I'm getting out 64-bit ints that represent the nanosecond equivalent of datetime values

######################################

#issue 2

I'm getting some unexpected behavior when working with numpy arrays that should return as datetime64 objects. In the above example, when I set output_dtypes to output_dtypes=[datetime64[ns] I'm getting a TypeError: Cannot cast NumPy timedelta64 scalar from metadata [ns] to according to the rule 'same_kind'.

I have tried many variations of explicitly setting input and output dtypes with no change in the error.

If I set output_dtypes=[ ] I am able to get a return of float64 values that I can convert after the fact to the expected datetime's. Although conversion after the fact isn't a huge problem, it seems to suggest to me that there is either an underlying issue or I have some misunderstanding.

If I remove dask I do get the same error when I replace degree_days with a NumPy backed xarray,

Minimal Complete Verifiable Example

#issue 1
import xarray as xr
import numpy as np

# Create a 10x10 array filled with the datetime 2000-01-05
lat = np.arange(10)
lon = np.arange(10)
date = np.datetime64("2000-01-05")

data = np.full((10, 10), date)

# Create the xarray DataArray
data_array = xr.DataArray(
    data, coords={"latitude": lat, "longitude": lon}, dims=["latitude", "longitude"]
)

# Calculate the timedelta in days from 2000-01-01
start_date = np.datetime64("2000-01-01")
timedelta_days = (data_array - start_date).astype("timedelta64[D]").astype(int)

print("Original DataArray:")
print(data_array)
print("\nTimedelta in days from 2000-01-01:")
print(timedelta_days)

#######################################################################################
#######################################################################################
#issue 2

import numpy as np
import pandas as pd
import xarray as xr
import dask.array as da

#############################################
#Function
#############################################
def day_cumsum_reaches_threshold_linear(
    degree_days, start_index, start_time_values, threshold
):
    cumsum = np.cumsum(degree_days[start_index:])
    threshold_reached = np.where(cumsum >= threshold)[0]
    if len(threshold_reached) == 0:
        print("error")
        return np.datetime64("NaT", "ns")
    first_reached_index = threshold_reached[0]
    result_date = start_time_values[start_index + first_reached_index]
    return result_date

#############################################
#Input data
#############################################

vday_cumsum_reaches_threshold_linear = np.vectorize(day_cumsum_reaches_threshold_linear)


time = pd.date_range("2000-01-01", periods=50, freq="D").to_numpy(
    dtype="datetime64[ns]"
)
lat = np.linspace(-90, 90, 10)
lon = np.linspace(-180, 180, 10)
degree_days = xr.DataArray(
    da.random.random((10, 10, 50), chunks=(10, 10, -1)),  # No chunking along time
    coords=[lat, lon, time],
    dims=["lat", "lon", "time"],
)
start_dates = xr.DataArray(
    np.random.choice(time[:5], size=(10, 10)), coords=[lat, lon], dims=["lat", "lon"]
)
start_indices = np.array(
    [np.where(degree_days.time.values == d)[0][0] for d in start_dates.values.flatten()]
).reshape(start_dates.shape)
threshold = 15

#############################################
#Apply function
#############################################


result_raw = xr.apply_ufunc(
    day_cumsum_reaches_threshold_linear,
    degree_days,
    start_indices,
    degree_days.time.values.astype("datetime64[ns]"),
    threshold,
    input_core_dims=[["time"], [], ["time"], []],
    output_core_dims=[[]],
    vectorize=True,
    dask="parallelized",
    output_dtypes=[],
)


result_raw.compute()

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

Dask version: 2024.7.1
-Numpy version 1.26.4 - used because 2.0 is currently incompatible with netCDF4
Xarray version 2024.6.0
Python version: 3.12.4
Operating System: ubuntu 22.04
Install method (conda, pip, source): conda
@ThomWorm ThomWorm added bug needs triage Issue that has not been reviewed by xarray team member labels Aug 20, 2024
Copy link

welcome bot commented Aug 20, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

Thanks for the issue. Is it possible to reduce the size of the example further? Does it require apply_ufunc or does the issue show up with a simpler implementation?

@ThomWorm
Copy link
Author

Issue 1 above is probably minimally simple - I'll work on simplifying issue 2 . Issue 2 might require a ufunc - but I'm not exactly sure what's triggering it and I haven't been able to recreate it without one.

@max-sixty
Copy link
Collaborator

Issue 1 gives the following warning:


<ipython-input-3-91373b81fea4>:1: UserWarning: Converting non-nanosecond precision timedelta values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
  (data_array - start_date).astype("timedelta64[D]")

So I think it's a matter of waiting until that's relaxed...

@TomNicholas TomNicholas added topic-cftime and removed needs triage Issue that has not been reviewed by xarray team member labels Aug 21, 2024
@spencerkclark
Copy link
Member

For issue (1) I would recommend using floor division with a unit timedelta:

>>> timedelta_days = (data_array - start_date) // np.timedelta64(1, "D")

The following seems to be a simpler reproducer for (2) still involving apply_ufunc, but there must be a more fundamental issue underneath:

da = xr.DataArray(range(5), dims=["x"])
xr.apply_ufunc(
    lambda x: np.datetime64("2000-01-01", "ns"),
    da.chunk(),
    vectorize=True, 
    output_dtypes=[np.dtype("datetime64[ns]")],
    dask="parallelized"
).compute()

@spencerkclark
Copy link
Member

This is an xarray-free reproducer for issue (2):

import dask.array
import numpy as np

dask.array.apply_gufunc(
    lambda x: np.datetime64("2000-01-01", "ns"),
    "()->()",
    dask.array.arange(5),
    vectorize=True,
    output_dtypes=[np.dtype("datetime64[ns]")]
).compute()

@max-sixty max-sixty added plan to close May be closeable, needs more eyeballs and removed bug labels Aug 23, 2024
@dcherian
Copy link
Contributor

That MCVE works for me on dask 2024.8.1. @ThomWorm please update dask and see if the error persists

@spencerkclark
Copy link
Member

Thanks @dcherian—I went back and checked my versions, and this error did not go away until I upgraded to the latest NumPy (regardless it is an upstream issue that fortunately seems to be fixed). @ThomWorm hopefully that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs topic-cftime
Projects
None yet
Development

No branches or pull requests

5 participants