-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update departures and climatology APIs with reference period #417
Conversation
a574360
to
3c24e1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lee1043 and @pochedls, this PR for implementing reference_period
in climatology()
and departures()
is ready for review.
I decided to not implement reference period subsetting for average()
and group_average()
because the user can easily do that themselves before calling those temporal APIs, e.g., ds.isel(time=slice(....))
.
Validation
In addition to the updated unit tests, below is the validation script that compares method 1 (@lee1043's cookbook) to method 2 (this PR implementation) for calculating climatologies and departures relative to a reference period. The result of the script shows that both methods produce identical results.
# flake8: noqa F401 E256
#%%
import matplotlib.pyplot as plt
import xcdat as xc
#%%
infile = "./input/HadISST_sst.nc"
data_var = "sst" # name of data variable
ds = xc.open_dataset(infile)
ds[data_var] = ds[data_var].where(ds[data_var] >= -0.5) # exclude missing value
#%%
# =======================================
# METHOD 1 - xCDAT + xarray
# Source: https://github.com/xCDAT/xcdat_test/blob/main/cookbook/seasonal_departure_from_baseline_clim.ipynb
# =======================================
# 1. Get seasonal average of the full time series
ds_season = ds.temporal.group_average(data_var, freq="season")
#%%
# 2. Temporal subset of original dataset to get reference climatology
ds_reference = ds.sel(time=slice("1971-01-01", "2000-12-31"))
ds_clim1 = ds_reference.temporal.climatology(data_var, freq="season")
ds_clim1[data_var].isel(time=0).plot(cmap="viridis")
#%%
# 3. Calculate departures
ds_departs1 = ds_season.copy()
ds_departs1[data_var + "_departure"] = ds_departs1[data_var].groupby(
"time.season"
) - ds_clim1[data_var].groupby("time.season").mean("time")
ds_departs1[data_var + "_departure"].isel(time=0).plot()
# %%
# =======================================
# METHOD 2 - xCDAT
# =======================================
# 1. Temporal subset of original dataset to get reference climatology
ds_clim2 = ds.temporal.climatology(
data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)
ds_clim2[data_var].isel(time=0).plot(cmap="viridis")
# Check equals method 1
ds_clim2["sst"].equals(ds_clim2["sst"]) # True
#%%
# 2. Calculate departures
ds_departs2 = ds.temporal.departures(
data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)
ds_departs2[data_var].isel(time=0).plot()
#%%
# =======================================
# Compare Outputs
# =======================================
# Drop season sub-coordinates
ds_departs1_new = ds_departs1.drop_vars("season")
# Get the expected and resulting departures
method1_anom = ds_departs1_new["sst_departure"].rename("sst")
method2_anom = ds_departs2["sst"]
#%%
# Compare equality
method2_anom.equals(method1_anom) # True
#%%
# Compare identical (drop attributes)
expected2 = method1_anom.copy()
result2 = method2_anom.copy()
expected2.attrs = {}
result2.attrs = {}
expected2.identical(result2) # Truegithub
@@ -341,6 +349,7 @@ def climatology( | |||
freq: Frequency, | |||
weighted: bool = True, | |||
keep_weights: bool = False, | |||
reference_period: Optional[Tuple[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
climatology()
new reference_period
parameter
xcdat/temporal.py
Outdated
reference_period : Optional[Tuple[str, str]], optional | ||
The climatological reference period, which is a subset of the entire | ||
time series. This parameter accepts a tuple of strings in the format | ||
'yyyy-mm-dd'. For example, ``('1850-01-01', 1899-12-31')``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of reference_period
parameter.
@@ -485,6 +504,7 @@ def departures( | |||
freq: Frequency, | |||
weighted: bool = True, | |||
keep_weights: bool = False, | |||
reference_period: Optional[Tuple[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
departures()
new reference_period
parameter
xcdat/temporal.py
Outdated
# 1. Get the grouped average of the observation data variable. | ||
ds_avg = ds.temporal.group_average( | ||
data_var, self._freq, weighted, keep_weights, season_config=season_config | ||
) | ||
dv_obs = ds_avg[data_var].copy() | ||
|
||
# 2. Group the observation data variable by the departures frequency. | ||
# This step allows us to perform xarray's grouped arithmetic to calculate | ||
# departures. | ||
self._labeled_time = self._label_time_coords(dv_obs[self.dim]) | ||
dv_obs_grouped = self._group_data(dv_obs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
departures()
now includes updated logic for performing a group average of the observation data variable based on the select freq
argument.
Grouping averaging was not previously implemented, which meant the departures data variable was on the same time frequency as in the input time coordinates (doesn't align with CDAT's behavior).
For example, with input time coordinates on a daily frequency:
- Before: pass
freq="season"
-> departures time coordinates return as daily - Now: pass
freq="season"
-> departures time coordinates return as seasonal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()
?
def _is_valid_reference_period(self, reference_period: Tuple[str, str]): | ||
try: | ||
datetime.strptime(reference_period[0], "%Y-%m-%d") | ||
datetime.strptime(reference_period[1], "%Y-%m-%d") | ||
except (IndexError, ValueError): | ||
raise ValueError( | ||
"Reference periods must be a tuple of strings with the format " | ||
"'yyyy-mm-dd'. For example, reference_period=('1850-01-01', " | ||
"'1899-12-31')." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method that validates the reference_period
argument
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #417 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 14
Lines 1293 1308 +15
=========================================
+ Hits 1293 1308 +15
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this PR seems to work great and makes sense to me. It takes me a long time to re-orient myself to the inner workings of the .temporal
accessor so I'm hoping @lee1043 can also give this a careful review.
- One thing I noticed is that calling
ds.temporal.departures
dropstime_bnds
– which I don't think needs to happen (I assume xarray doesn't know how to deal with this during the subtraction of the climatology). I think the time bounds can be added back in after subtraction. I noticed this in verifying that this PR was accurate (I get accuracy of order 10-4 – 10-7, I think because I added time bounds back in usingadd_missing_bounds
, which is problematic). - Note to others that I did need to update my dev environment to pass the unit tests
- Made some minor documentation suggestions
- Steps zero and one of
departures
call_set_arg_attrs()
,.copy()
,_preprocess_dataset()
, andgroup_average()
. This overlaps strongly with_averager()
– maybe things could be streamlined here (though I see that some of the order-of-operations is different, which may be why they are different. - As noted, departures calls
group_average()
on each call – I think this may not be necesary if the input/output frequency are the same?
xcdat/temporal.py
Outdated
# 1. Get the grouped average of the observation data variable. | ||
ds_avg = ds.temporal.group_average( | ||
data_var, self._freq, weighted, keep_weights, season_config=season_config | ||
) | ||
dv_obs = ds_avg[data_var].copy() | ||
|
||
# 2. Group the observation data variable by the departures frequency. | ||
# This step allows us to perform xarray's grouped arithmetic to calculate | ||
# departures. | ||
self._labeled_time = self._label_time_coords(dv_obs[self.dim]) | ||
dv_obs_grouped = self._group_data(dv_obs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()
?
Before this PR:
In this PR:
Solutions/Workarounds:
I saw your point and played around with different implements using I updated the order-of-operations and made clearer comments in the
I also don't think we need to perform group averaging if the input/output frequencies are the same. It still works either way (and converts time coordinates to We can try determining if |
# The climatology and grouped average APIs are called on the copied | ||
# dataset to create separate instances of the `TemporalAccessor` class. | ||
# This is done to avoid overriding the attributes of the current | ||
# instance of `TemporalAccessor` (set in step #1 above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the departures()
performance or memory usage is ever an issue, we can look into refactoring these steps to avoid copying. Copying datasets multiple times might not be an issue though because xr.Dataset.copy() defaults to deep=False
("... a shallow copy of each of the component variable is made, so that the underlying memory region of the new dataset is the same as in the original dataset.")
On the other hand, xr.DataArray.copy() defaults to deep=True
.
# Group averaging is only required if the dataset's frequency (input) | ||
# differs from the `freq` arg (output). | ||
ds_obs = ds.copy() | ||
inferred_freq = self._infer_freq() | ||
if inferred_freq != freq: | ||
ds_obs = ds_obs.temporal.group_average( | ||
data_var, | ||
freq, | ||
weighted, | ||
keep_weights, | ||
season_config, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added conditional for group averaging only if the input and output frequencies are the same. This should save on performance costs for this case.
Related to #417 (comment) and #417 (review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thank you for adding the conditional for group averaging. I think this may speed up the code and also has the side effect of leaving the bounds.
If you accept my modifications in #418, there will need to be some minor changes when calling _infer_freq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My final self-review. I'm going to merge after the build passes.
@jasonb5's code review for departures()
is still useful.
|
||
# Preprocess the dataset based on method argument values. | ||
# 2. Copy the original dataset and preprocess (if needed) for reuse. | ||
# ---------------------------------------------------------------------- | ||
ds = self._dataset.copy() | ||
|
||
# Preprocess ahead of time to avoid needing to preprocess again when | ||
# calling the group average or climatology APIs in step #3. | ||
ds = self._preprocess_dataset(ds) | ||
|
||
# Group the observation data variable. | ||
dv_obs = _get_data_var(ds, data_var) | ||
self._labeled_time = self._label_time_coords(dv_obs[self.dim]) | ||
dv_obs_grouped = self._group_data(dv_obs) | ||
# 3. Calculate the grouped average and climatology of the data variable. | ||
# ---------------------------------------------------------------------- | ||
# The climatology and grouped average APIs are called on the copied | ||
# dataset to create separate instances of the `TemporalAccessor` class. | ||
# This is done to avoid overriding the attributes of the current | ||
# instance of `TemporalAccessor` (set in step #1 above). | ||
|
||
# Group averaging is only required if the dataset's frequency (input) | ||
# differs from the `freq` arg (output). | ||
ds_obs = ds.copy() | ||
inferred_freq = self._infer_freq() | ||
if inferred_freq != freq: | ||
ds_obs = ds_obs.temporal.group_average( | ||
data_var, | ||
freq, | ||
weighted, | ||
keep_weights, | ||
season_config, | ||
) | ||
|
||
# Calculate the climatology of the data variable. | ||
ds_climo = ds.temporal.climatology( | ||
data_var, freq, weighted, keep_weights, season_config | ||
data_var, | ||
freq, | ||
weighted, | ||
keep_weights, | ||
reference_period, | ||
season_config, | ||
) | ||
dv_climo = ds_climo[data_var] | ||
|
||
# Rename the time dimension using the name from the grouped observation | ||
# data. The dimension names must align for xarray's grouped arithmetic | ||
# to work. Otherwise, the error below is thrown: `ValueError: | ||
# incompatible dimensions for a grouped binary operation: the group | ||
# variable '<CHOSEN_FREQ>' is not a dimension on the other argument` | ||
# 4. Group the averaged data variable values by the time `freq`. | ||
# ---------------------------------------------------------------------- | ||
# This step allows us to perform xarray's grouped arithmetic to | ||
# calculate departures. | ||
dv_obs = ds_obs[data_var].copy() | ||
self._labeled_time = self._label_time_coords(dv_obs[self.dim]) | ||
dv_obs_grouped = self._group_data(dv_obs) | ||
|
||
# 5. Align time dimension names using the labeled time dimension name. | ||
# ---------------------------------------------------------------------- | ||
# The climatology's time dimension is renamed to the labeled time | ||
# dimension in step #4 above (e.g., "time" -> "season"). xarray requires | ||
# dimension names to be aligned to perform grouped arithmetic, which we | ||
# use for calculating departures in step #5. Otherwise, this error is | ||
# raised: "`ValueError: incompatible dimensions for a grouped binary | ||
# operation: the group variable '<FREQ ARG>' is not a dimension on the | ||
# other argument`". | ||
dv_climo = ds_climo[data_var] | ||
dv_climo = dv_climo.rename({self.dim: self._labeled_time.name}) | ||
|
||
# Calculate the departures for the data variable, which uses the formula | ||
# observation - climatology. | ||
# 6. Calculate the departures for the data variable. | ||
# ---------------------------------------------------------------------- | ||
# departures = observation - climatology | ||
with xr.set_options(keep_attrs=True): | ||
ds_departs = ds.copy() | ||
ds_departs[data_var] = dv_obs_grouped - dv_climo | ||
ds_departs[data_var] = self._add_operation_attrs(ds_departs[data_var]) | ||
dv_departs = dv_obs_grouped - dv_climo | ||
dv_departs = self._add_operation_attrs(dv_departs) | ||
ds_obs[data_var] = dv_departs | ||
|
||
# The original time coordinates are restored after grouped | ||
# subtraction. The grouped time coordinates are dropped since they | ||
# are no longer required. | ||
ds_departs = ds_departs.drop_vars(self._labeled_time.name) | ||
# The original time dimension name is restored after grouped | ||
# arithmetic, so the labeled time dimension name is no longer needed | ||
# and therefore dropped. | ||
ds_obs = ds_obs.drop_vars(self._labeled_time.name) | ||
|
||
if weighted and keep_weights: | ||
self._weights = ds_climo.time_wts | ||
ds_departs = self._keep_weights(ds_departs) | ||
ds_obs = self._keep_weights(ds_obs) | ||
|
||
return ds_departs | ||
return ds_obs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jasonb5, I'm tagging you for code review mainly with the departures()
method.
Let me know if the comments make sense. As mentioned in the meeting, we can revisit refactoring the implementation if needed. In my opinion I think it currently works well and the new comments explain the intention behind the implementation design.
Thanks for the review Steve. After this PR is merged, we can rebase #418 on |
- Add `reference_period` argument to subset climatology - Get the grouped average of the observation data for calculating departures
Co-authored-by: Stephen Po-Chedley <pochedley@gmail.com>
db2e5f9
to
dbc2c63
Compare
Description
reference_period
parameter toclimatology()
anddepartures()
to subset climatologydepartures()
, calculate the the grouped average of the observation data before calculating anomaliesdepartures()
align with CDAT's APIs for calculating departuresfreq
arg, no grouped average is performed since it is unnecessary and incurs a performance costdepartures()
notebook withreference_period
exampleChecklist
If applicable: