-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Consider how to test for correct weights generation in xCDAT #699
Comments
@xCDAT/core-developers Any input would be appreciated here. Thanks. |
I think you can go with option 1 (drop the assertion). Even though I raised concern about removing this, when I review this code carefully, I don't think it is needed. Do you remember why it was put there in the first place? I don't think I helped write this section of code, but I sometimes add these kind of tests to make sure NaNs aren't messing things up. I don't think that should be an issue in this instance. |
I based my initial implementation of |
I ran all of the grouping-based temporal averaging APIs with a dummy dataset that has bad bounds. They all worked and did not throw an AssertionError. Weights should always correctly add up to 1.0 for each group based on how the _group_data() method works. This method groups time coordinates by their group label. The time bounds are used to calculate the weight for each specific coordinate. If we sum up weights for each group, they should still add up to 1.0. If there are bad time coordinates (e.g., one very off time coordinate), it will be its own group with a sum weight of 1.0. This is a data quality issue that the user should address beforehand. # %%
import numpy as np
import pandas as pd
import xarray as xr
import xcdat as xc
# Define a time range with valid dates
times = pd.date_range(start="2000-01-01", periods=30, freq="ME")
# %%
# Create time bounds (lower and upper values)
time_bnds = np.array([times[:-1], times[1:]]).T
time_bnds = np.concatenate(
(
time_bnds,
np.array([[times[-1], pd.Timestamp("2010-12-31")]], dtype="datetime64[ns]"),
)
)
# Introduce incorrect time bounds
time_bnds[4, 1] = pd.Timestamp("3000-01-01")
time_bnds[8, 0] = pd.Timestamp("3000-01-01")
# %%
# Create some dummy data
data = np.random.rand(len(times))
# Create a dataset with these time coordinates and time bounds
ds = xr.Dataset(
{"data": (["time"], data)},
coords={"time": times, "time_bnds": (["time", "bnds"], time_bnds)},
)
# Add CF attributes
ds.attrs = {
"Conventions": "CF-1.8",
"title": "Sample Dataset",
"institution": "Your Institution",
"source": "Synthetic data",
"history": "Created for demonstration purposes",
"references": "http://example.com",
}
ds["data"].attrs = {
"long_name": "Random Data",
"units": "1",
"standard_name": "random_data",
}
ds["time"].attrs = {
"long_name": "Time",
"standard_name": "time",
"axis": "T",
"bounds": "time_bnds",
}
ds["time_bnds"].attrs = {
"long_name": "Time Bounds",
"standard_name": "time_bnds",
}
# %%
ds_day = ds.temporal.group_average("data", freq="day")
ds_month = ds.temporal.group_average("data", freq="month")
ds_month = ds.temporal.group_average("data", freq="season")
ds_year = ds.temporal.group_average("data", freq="year")
# %%
ds_climo_day = ds.temporal.climatology("data", freq="day")
ds_climo_month = ds.temporal.climatology("data", freq="month")
ds_climo_season = ds.temporal.climatology("data", freq="season")
# %%
ds_depart_day = ds.temporal.departures("data", freq="day")
ds_depart_month = ds.temporal.departures("data", freq="month")
ds_depart_season = ds.temporal.departures("data", freq="season")
|
Closing this issue based on my findings. |
Is your feature request related to a problem?
In PR #689, in the
TemporalAccessor._get_weights()
method, I removed validation that checks the sums of weights for each time group adds up to 1 (related code).This was done because:
-O
flag), not good practice. I copied this validation code from an Xarray notebook (link) without realizing the implications.np.testing.assert_allclose()
degrades performance -- I expect the performance degradation to scale up in relation to the size to the data, maybe it loads data into memory with.values
?Describe the solution you'd like
What is an assertion and when to use it
Option 1: Don't keep this assertion
I think the sum of weights for each group should always be 1.0 (100%) based on our implementation logic. Otherwise, the assertion would indicate a coding error on our behalf rather than bad data (although we can raise an exception if it is indeed actually due to bad time bounds). As far as I can tell, our implementation logic is right and nobody has ran into
_get_weights()
throwing anAssertionError
from the validation code.The
_get_weights()
method works by:time_lengths
)grouped_time_lengths
)grouped_time_length / grouped_time_lengths.sum()
)xcdat/xcdat/temporal.py
Lines 1213 to 1262 in e9e73dd
Option 2: Keep this assertion
Maybe the time bounds might not be correct for some reason and it produces incorrect weight for certain groups (e.g., missing data)? Not sure here. I will try experimenting with bad time bounds.
If we want to keep this assertion:
np.testing.assert_allclose()
in production since it raises anAssertionError
that can be turned off by the usernp.testing.assert_allclose()
RuntimeError
if numpy assertion error is raisedDescribe alternatives you've considered
No response
Additional context
After this ticket is addressed, we can proceed with releasing v0.7.2.
The text was updated successfully, but these errors were encountered: