Update departures and climatology APIs with reference period #417

tomvothecoder · 2023-02-13T19:46:27Z

Description

Closes [Feature]: Allow user selection of reference period in departures #387
Add reference_period parameter to climatology() and departures() to subset climatology
In departures(), calculate the the grouped average of the observation data before calculating anomalies
- This behavior makes departures() align with CDAT's APIs for calculating departures
- If the input frequency from the dataset is the same as the output frequency freq arg, no grouped average is performed since it is unnecessary and incurs a performance cost
Update departures() notebook with reference_period example

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

tomvothecoder

Hi @lee1043 and @pochedls, this PR for implementing reference_period in climatology() and departures() is ready for review.

I decided to not implement reference period subsetting for average() and group_average() because the user can easily do that themselves before calling those temporal APIs, e.g., ds.isel(time=slice(....)).

Validation

In addition to the updated unit tests, below is the validation script that compares method 1 (@lee1043's cookbook) to method 2 (this PR implementation) for calculating climatologies and departures relative to a reference period. The result of the script shows that both methods produce identical results.

# flake8: noqa F401 E256
#%%
import matplotlib.pyplot as plt

import xcdat as xc

#%%
infile = "./input/HadISST_sst.nc"
data_var = "sst"  #  name of data variable

ds = xc.open_dataset(infile)
ds[data_var] = ds[data_var].where(ds[data_var] >= -0.5)  # exclude missing value

#%%
# =======================================
#  METHOD 1 - xCDAT + xarray
#  Source: https://github.com/xCDAT/xcdat_test/blob/main/cookbook/seasonal_departure_from_baseline_clim.ipynb
# =======================================
# 1. Get seasonal average of the full time series
ds_season = ds.temporal.group_average(data_var, freq="season")

#%%
# 2. Temporal subset of original dataset to get reference climatology
ds_reference = ds.sel(time=slice("1971-01-01", "2000-12-31"))
ds_clim1 = ds_reference.temporal.climatology(data_var, freq="season")
ds_clim1[data_var].isel(time=0).plot(cmap="viridis")

#%%
# 3. Calculate departures
ds_departs1 = ds_season.copy()
ds_departs1[data_var + "_departure"] = ds_departs1[data_var].groupby(
    "time.season"
) - ds_clim1[data_var].groupby("time.season").mean("time")
ds_departs1[data_var + "_departure"].isel(time=0).plot()

# %%
# =======================================
#  METHOD 2 - xCDAT
# =======================================
# 1. Temporal subset of original dataset to get reference climatology
ds_clim2 = ds.temporal.climatology(
    data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)

ds_clim2[data_var].isel(time=0).plot(cmap="viridis")

# Check equals method 1
ds_clim2["sst"].equals(ds_clim2["sst"])  # True

#%%
# 2. Calculate departures
ds_departs2 = ds.temporal.departures(
    data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)
ds_departs2[data_var].isel(time=0).plot()

#%%
# =======================================
#  Compare Outputs
# =======================================
# Drop season sub-coordinates
ds_departs1_new = ds_departs1.drop_vars("season")

# Get the expected and resulting departures
method1_anom = ds_departs1_new["sst_departure"].rename("sst")
method2_anom = ds_departs2["sst"]
#%%
# Compare equality
method2_anom.equals(method1_anom)  # True

#%%
# Compare identical (drop attributes)
expected2 = method1_anom.copy()
result2 = method2_anom.copy()

expected2.attrs = {}
result2.attrs = {}

expected2.identical(result2)  # Truegithub

tomvothecoder · 2023-02-14T20:51:23Z

xcdat/temporal.py

@@ -341,6 +349,7 @@ def climatology(
        freq: Frequency,
        weighted: bool = True,
        keep_weights: bool = False,
+        reference_period: Optional[Tuple[str, str]] = None,


climatology() new reference_period parameter

tomvothecoder · 2023-02-14T20:51:36Z

xcdat/temporal.py

+        reference_period : Optional[Tuple[str, str]], optional
+            The climatological reference period, which is a subset of the entire
+            time series. This parameter accepts a tuple of strings in the format
+            'yyyy-mm-dd'. For example, ``('1850-01-01', 1899-12-31')``.


Description of reference_period parameter.

tomvothecoder · 2023-02-14T20:51:55Z

xcdat/temporal.py

@@ -485,6 +504,7 @@ def departures(
        freq: Frequency,
        weighted: bool = True,
        keep_weights: bool = False,
+        reference_period: Optional[Tuple[str, str]] = None,


departures() new reference_period parameter

tomvothecoder · 2023-02-14T20:53:53Z

xcdat/temporal.py

+        # 1. Get the grouped average of the observation data variable.
+        ds_avg = ds.temporal.group_average(
+            data_var, self._freq, weighted, keep_weights, season_config=season_config
+        )
+        dv_obs = ds_avg[data_var].copy()
+
+        # 2. Group the observation data variable by the departures frequency.
+        # This step allows us to perform xarray's grouped arithmetic to calculate
+        # departures.
        self._labeled_time = self._label_time_coords(dv_obs[self.dim])
        dv_obs_grouped = self._group_data(dv_obs)


departures() now includes updated logic for performing a group average of the observation data variable based on the select freq argument.

Grouping averaging was not previously implemented, which meant the departures data variable was on the same time frequency as in the input time coordinates (doesn't align with CDAT's behavior).

For example, with input time coordinates on a daily frequency:

Before: pass freq="season" -> departures time coordinates return as daily

Now: pass freq="season" -> departures time coordinates return as seasonal

Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()?

tomvothecoder · 2023-02-14T20:58:15Z

xcdat/temporal.py

+    def _is_valid_reference_period(self, reference_period: Tuple[str, str]):
+        try:
+            datetime.strptime(reference_period[0], "%Y-%m-%d")
+            datetime.strptime(reference_period[1], "%Y-%m-%d")
+        except (IndexError, ValueError):
+            raise ValueError(
+                "Reference periods must be a tuple of strings with the format "
+                "'yyyy-mm-dd'. For example, reference_period=('1850-01-01', "
+                "'1899-12-31')."
+            )


The method that validates the reference_period argument

xcdat/temporal.py

codecov-commenter · 2023-02-14T21:07:54Z

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (e6dbed5) 100.00% compared to head (dbc2c63) 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #417   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1293      1308   +15     
=========================================
+ Hits          1293      1308   +15

Impacted Files	Coverage Δ
xcdat/temporal.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pochedls

Overall, this PR seems to work great and makes sense to me. It takes me a long time to re-orient myself to the inner workings of the .temporal accessor so I'm hoping @lee1043 can also give this a careful review.

One thing I noticed is that calling ds.temporal.departures drops time_bnds – which I don't think needs to happen (I assume xarray doesn't know how to deal with this during the subtraction of the climatology). I think the time bounds can be added back in after subtraction. I noticed this in verifying that this PR was accurate (I get accuracy of order 10^-4 – 10^-7, I think because I added time bounds back in using add_missing_bounds, which is problematic).
Note to others that I did need to update my dev environment to pass the unit tests
Made some minor documentation suggestions
Steps zero and one of departures call _set_arg_attrs(), .copy(), _preprocess_dataset(), and group_average(). This overlaps strongly with _averager() – maybe things could be streamlined here (though I see that some of the order-of-operations is different, which may be why they are different.
As noted, departures calls group_average() on each call – I think this may not be necesary if the input/output frequency are the same?

xcdat/temporal.py

pochedls · 2023-02-21T22:23:17Z

xcdat/temporal.py

+        # 1. Get the grouped average of the observation data variable.
+        ds_avg = ds.temporal.group_average(
+            data_var, self._freq, weighted, keep_weights, season_config=season_config
+        )
+        dv_obs = ds_avg[data_var].copy()
+
+        # 2. Group the observation data variable by the departures frequency.
+        # This step allows us to perform xarray's grouped arithmetic to calculate
+        # departures.
        self._labeled_time = self._label_time_coords(dv_obs[self.dim])
        dv_obs_grouped = self._group_data(dv_obs)


Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()?

tomvothecoder · 2023-02-23T21:13:44Z

One thing I noticed is that calling ds.temporal.departures drops time_bnds – which I don't think needs to happen (I assume xarray doesn't know how to deal with this during the subtraction of the climatology). I think the time bounds can be added back in after subtraction. I noticed this in verifying that this PR was accurate (I get accuracy of order 10-4 – 10-7, I think because I added time bounds back in using add_missing_bounds, which is problematic).

Before this PR:

departures() used to maintain time_bnds because the output frequency was the same as the input frequency (no group_average() call).

In this PR:

Since we now call group_average(), time_bnds are dropped because the time coordinates are grouped which changes the time frequency and shape.

Solutions/Workarounds:

As you mentioned, get new time_bnds for the new grouped time coordinates using add_missing_bounds(). I think the accuracy should improve with [Feature]: Add functionality to accurately set time bounds #412.
Keep a separate time_original dimension/coordinates to keep time_bnds in the dataset (otherwise we get time_bnds with NaT values since time frequency and shape changes).

Steps zero and one of departures call _set_arg_attrs(), .copy(), _preprocess_dataset(), and group_average(). This overlaps strongly with _averager() – maybe things could be streamlined here (though I see that some of the order-of-operations is different, which may be why they are different.

I saw your point and played around with different implements using _average() to tried to find ways to streamline the implementation. I wasn't able to find a simple way to call self._average() without making significant modifications to how the TemporalAccessor class attributes are set (e.g., self._mode). We can probably try refactoring this in the future if we wanted to.

I updated the order-of-operations and made clearer comments in the departures() method so it is hopefully easier to understand. Please check out this commit diff and let me know if those comments make sense.

As noted, departures calls group_average() on each call – I think this may not be necesary if the input/output frequency are the same?

I also don't think we need to perform group averaging if the input/output frequencies are the same. It still works either way (and converts time coordinates to cftime if they aren't already), but it might improve performance by removing unnecessary operations.

We can try determining if group_average() is necessary by using the existing _infer_freq() method and comparing the result to the input freq argument.

tomvothecoder · 2023-02-27T18:25:33Z

xcdat/temporal.py

+        # The climatology and grouped average APIs are called on the copied
+        # dataset to create separate instances of the `TemporalAccessor` class.
+        # This is done to avoid overriding the attributes of the current
+        # instance of `TemporalAccessor` (set in step #1 above).


If the departures() performance or memory usage is ever an issue, we can look into refactoring these steps to avoid copying. Copying datasets multiple times might not be an issue though because xr.Dataset.copy() defaults to deep=False ("... a shallow copy of each of the component variable is made, so that the underlying memory region of the new dataset is the same as in the original dataset.")

On the other hand, xr.DataArray.copy() defaults to deep=True.

tomvothecoder · 2023-02-27T18:25:49Z

xcdat/temporal.py

+        # Group averaging is only required if the dataset's frequency (input)
+        # differs from the `freq` arg (output).
+        ds_obs = ds.copy()
+        inferred_freq = self._infer_freq()
+        if inferred_freq != freq:
+            ds_obs = ds_obs.temporal.group_average(
+                data_var,
+                freq,
+                weighted,
+                keep_weights,
+                season_config,
+            )


Added conditional for group averaging only if the input and output frequencies are the same. This should save on performance costs for this case.

Related to #417 (comment) and #417 (review).

pochedls

This looks good to me. Thank you for adding the conditional for group averaging. I think this may speed up the code and also has the side effect of leaving the bounds.

If you accept my modifications in #418, there will need to be some minor changes when calling _infer_freq.

tomvothecoder

My final self-review. I'm going to merge after the build passes.

@jasonb5's code review for departures() is still useful.

tomvothecoder · 2023-03-01T21:06:42Z

xcdat/temporal.py


-        # Preprocess the dataset based on method argument values.
+        # 2. Copy the original dataset and preprocess (if needed) for reuse.
+        # ----------------------------------------------------------------------
+        ds = self._dataset.copy()
+
+        # Preprocess ahead of time to avoid needing to preprocess again when
+        # calling the group average or climatology APIs in step #3.
        ds = self._preprocess_dataset(ds)

-        # Group the observation data variable.
-        dv_obs = _get_data_var(ds, data_var)
-        self._labeled_time = self._label_time_coords(dv_obs[self.dim])
-        dv_obs_grouped = self._group_data(dv_obs)
+        # 3. Calculate the grouped average and climatology of the data variable.
+        # ----------------------------------------------------------------------
+        # The climatology and grouped average APIs are called on the copied
+        # dataset to create separate instances of the `TemporalAccessor` class.
+        # This is done to avoid overriding the attributes of the current
+        # instance of `TemporalAccessor` (set in step #1 above).
+
+        # Group averaging is only required if the dataset's frequency (input)
+        # differs from the `freq` arg (output).
+        ds_obs = ds.copy()
+        inferred_freq = self._infer_freq()
+        if inferred_freq != freq:
+            ds_obs = ds_obs.temporal.group_average(
+                data_var,
+                freq,
+                weighted,
+                keep_weights,
+                season_config,
+            )

-        # Calculate the climatology of the data variable.
        ds_climo = ds.temporal.climatology(
-            data_var, freq, weighted, keep_weights, season_config
+            data_var,
+            freq,
+            weighted,
+            keep_weights,
+            reference_period,
+            season_config,
        )
-        dv_climo = ds_climo[data_var]

-        # Rename the time dimension using the name from the grouped observation
-        # data. The dimension names must align for xarray's grouped arithmetic
-        # to work. Otherwise, the error below is thrown: `ValueError:
-        # incompatible dimensions for a grouped binary operation: the group
-        # variable '<CHOSEN_FREQ>' is not a dimension on the other argument`
+        # 4. Group the averaged data variable values by the time `freq`.
+        # ----------------------------------------------------------------------
+        # This step allows us to perform xarray's grouped arithmetic to
+        # calculate departures.
+        dv_obs = ds_obs[data_var].copy()
+        self._labeled_time = self._label_time_coords(dv_obs[self.dim])
+        dv_obs_grouped = self._group_data(dv_obs)
+
+        # 5. Align time dimension names using the labeled time dimension name.
+        # ----------------------------------------------------------------------
+        # The climatology's time dimension is renamed to the labeled time
+        # dimension in step #4 above (e.g., "time" -> "season"). xarray requires
+        # dimension names to be aligned to perform grouped arithmetic, which we
+        # use for calculating departures in step #5. Otherwise, this error is
+        # raised: "`ValueError: incompatible dimensions for a grouped binary
+        # operation: the group variable '<FREQ ARG>' is not a dimension on the
+        # other argument`".
+        dv_climo = ds_climo[data_var]
        dv_climo = dv_climo.rename({self.dim: self._labeled_time.name})

-        # Calculate the departures for the data variable, which uses the formula
-        # observation - climatology.
+        # 6. Calculate the departures for the data variable.
+        # ----------------------------------------------------------------------
+        # departures = observation - climatology
        with xr.set_options(keep_attrs=True):
-            ds_departs = ds.copy()
-            ds_departs[data_var] = dv_obs_grouped - dv_climo
-            ds_departs[data_var] = self._add_operation_attrs(ds_departs[data_var])
+            dv_departs = dv_obs_grouped - dv_climo
+            dv_departs = self._add_operation_attrs(dv_departs)
+            ds_obs[data_var] = dv_departs

-            # The original time coordinates are restored after grouped
-            # subtraction. The grouped time coordinates are dropped since they
-            # are no longer required.
-            ds_departs = ds_departs.drop_vars(self._labeled_time.name)
+            # The original time dimension name is restored after grouped
+            # arithmetic, so the labeled time dimension name is no longer needed
+            # and therefore dropped.
+            ds_obs = ds_obs.drop_vars(self._labeled_time.name)

        if weighted and keep_weights:
            self._weights = ds_climo.time_wts
-            ds_departs = self._keep_weights(ds_departs)
+            ds_obs = self._keep_weights(ds_obs)

-        return ds_departs
+        return ds_obs


Hey @jasonb5, I'm tagging you for code review mainly with the departures() method.

Let me know if the comments make sense. As mentioned in the meeting, we can revisit refactoring the implementation if needed. In my opinion I think it currently works well and the new comments explain the intention behind the implementation design.

tests/test_temporal.py

tomvothecoder · 2023-03-08T17:21:40Z

This looks good to me. Thank you for adding the conditional for group averaging. I think this may speed up the code and also has the side effect of leaving the bounds.

If you accept my modifications in #418, there will need to be some minor changes when calling _infer_freq.

Thanks for the review Steve. After this PR is merged, we can rebase #418 on main and update that _infer_freq. call.

- Add `reference_period` argument to subset climatology - Get the grouped average of the observation data for calculating departures

- Fix tests

Co-authored-by: Stephen Po-Chedley <pochedley@gmail.com>

tomvothecoder added the type: enhancement New enhancement request label Feb 13, 2023

tomvothecoder self-assigned this Feb 13, 2023

tomvothecoder force-pushed the feature/387-departures-ref-period branch 2 times, most recently from a574360 to 3c24e1b Compare February 13, 2023 23:54

tomvothecoder commented Feb 14, 2023

View reviewed changes

tomvothecoder marked this pull request as ready for review February 14, 2023 21:02

tomvothecoder requested review from lee1043 and pochedls February 14, 2023 21:02

pochedls reviewed Feb 21, 2023

View reviewed changes

tomvothecoder commented Feb 27, 2023

View reviewed changes

tomvothecoder requested a review from jasonb5 March 1, 2023 21:04

pochedls reviewed Mar 6, 2023

View reviewed changes

tomvothecoder commented Mar 8, 2023

View reviewed changes

tomvothecoder and others added 11 commits March 8, 2023 09:28

Initial implementation of reference_period

af6184b

Update departures and climatology APIs

a9c5687

- Add `reference_period` argument to subset climatology - Get the grouped average of the observation data for calculating departures

Remove outdated comments

ac82783

Revert adding new departures data variable to dataset

0dcfb84

- Fix tests

Update xcdat/temporal.py

9376fca

Update comments in departures() for clarity

069d57d

Apply suggestions from code review

a0f3660

Co-authored-by: Stephen Po-Chedley <pochedley@gmail.com>

Add group averaging if input and out freq differ

1c67987

Update climo/departs notebook with reference period arg

5bab315

Update tests/test_temporal.py

72b9961

Fix test

dbc2c63

tomvothecoder force-pushed the feature/387-departures-ref-period branch from db2e5f9 to dbc2c63 Compare March 8, 2023 17:38

tomvothecoder merged commit 1d88dac into main Mar 8, 2023

tomvothecoder deleted the feature/387-departures-ref-period branch March 8, 2023 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update departures and climatology APIs with reference period #417

Update departures and climatology APIs with reference period #417

tomvothecoder commented Feb 13, 2023 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Feb 14, 2023

tomvothecoder Feb 14, 2023

tomvothecoder Feb 14, 2023

tomvothecoder Feb 14, 2023 •

edited

Loading

pochedls Feb 21, 2023

tomvothecoder Feb 14, 2023

codecov-commenter commented Feb 14, 2023 •

edited

Loading

pochedls left a comment

pochedls Feb 21, 2023

tomvothecoder commented Feb 23, 2023 •

edited

Loading

tomvothecoder Feb 27, 2023

tomvothecoder Feb 27, 2023

pochedls left a comment

tomvothecoder left a comment

tomvothecoder Mar 1, 2023

tomvothecoder commented Mar 8, 2023

Update departures and climatology APIs with reference period #417

Update departures and climatology APIs with reference period #417

Conversation

tomvothecoder commented Feb 13, 2023 • edited Loading

Description

Checklist

tomvothecoder left a comment • edited Loading

Choose a reason for hiding this comment

Validation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 14, 2023 • edited Loading

Codecov Report

pochedls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pochedls left a comment

Choose a reason for hiding this comment

tomvothecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Mar 8, 2023

tomvothecoder commented Feb 13, 2023 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Feb 14, 2023 •

edited

Loading

codecov-commenter commented Feb 14, 2023 •

edited

Loading

tomvothecoder commented Feb 23, 2023 •

edited

Loading