Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update departures and climatology APIs with reference period #417

Merged
merged 11 commits into from
Mar 8, 2023

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Feb 13, 2023

Description

  • Closes [Feature]: Allow user selection of reference period in departures #387
  • Add reference_period parameter to climatology() and departures() to subset climatology
  • In departures(), calculate the the grouped average of the observation data before calculating anomalies
    • This behavior makes departures() align with CDAT's APIs for calculating departures
    • If the input frequency from the dataset is the same as the output frequency freq arg, no grouped average is performed since it is unnecessary and incurs a performance cost
  • Update departures() notebook with reference_period example

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Feb 13, 2023
@tomvothecoder tomvothecoder self-assigned this Feb 13, 2023
@tomvothecoder tomvothecoder force-pushed the feature/387-departures-ref-period branch 2 times, most recently from a574360 to 3c24e1b Compare February 13, 2023 23:54
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lee1043 and @pochedls, this PR for implementing reference_period in climatology() and departures() is ready for review.

I decided to not implement reference period subsetting for average() and group_average() because the user can easily do that themselves before calling those temporal APIs, e.g., ds.isel(time=slice(....)).

Validation

In addition to the updated unit tests, below is the validation script that compares method 1 (@lee1043's cookbook) to method 2 (this PR implementation) for calculating climatologies and departures relative to a reference period. The result of the script shows that both methods produce identical results.

# flake8: noqa F401 E256
#%%
import matplotlib.pyplot as plt

import xcdat as xc

#%%
infile = "./input/HadISST_sst.nc"
data_var = "sst"  #  name of data variable

ds = xc.open_dataset(infile)
ds[data_var] = ds[data_var].where(ds[data_var] >= -0.5)  # exclude missing value

#%%
# =======================================
#  METHOD 1 - xCDAT + xarray
#  Source: https://github.com/xCDAT/xcdat_test/blob/main/cookbook/seasonal_departure_from_baseline_clim.ipynb
# =======================================
# 1. Get seasonal average of the full time series
ds_season = ds.temporal.group_average(data_var, freq="season")

#%%
# 2. Temporal subset of original dataset to get reference climatology
ds_reference = ds.sel(time=slice("1971-01-01", "2000-12-31"))
ds_clim1 = ds_reference.temporal.climatology(data_var, freq="season")
ds_clim1[data_var].isel(time=0).plot(cmap="viridis")

#%%
# 3. Calculate departures
ds_departs1 = ds_season.copy()
ds_departs1[data_var + "_departure"] = ds_departs1[data_var].groupby(
    "time.season"
) - ds_clim1[data_var].groupby("time.season").mean("time")
ds_departs1[data_var + "_departure"].isel(time=0).plot()

# %%
# =======================================
#  METHOD 2 - xCDAT
# =======================================
# 1. Temporal subset of original dataset to get reference climatology
ds_clim2 = ds.temporal.climatology(
    data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)

ds_clim2[data_var].isel(time=0).plot(cmap="viridis")

# Check equals method 1
ds_clim2["sst"].equals(ds_clim2["sst"])  # True

#%%
# 2. Calculate departures
ds_departs2 = ds.temporal.departures(
    data_var, freq="season", reference_period=("1971-01-01", "2000-12-31")
)
ds_departs2[data_var].isel(time=0).plot()

#%%
# =======================================
#  Compare Outputs
# =======================================
# Drop season sub-coordinates
ds_departs1_new = ds_departs1.drop_vars("season")

# Get the expected and resulting departures
method1_anom = ds_departs1_new["sst_departure"].rename("sst")
method2_anom = ds_departs2["sst"]
#%%
# Compare equality
method2_anom.equals(method1_anom)  # True

#%%
# Compare identical (drop attributes)
expected2 = method1_anom.copy()
result2 = method2_anom.copy()

expected2.attrs = {}
result2.attrs = {}

expected2.identical(result2)  # Truegithub

@@ -341,6 +349,7 @@ def climatology(
freq: Frequency,
weighted: bool = True,
keep_weights: bool = False,
reference_period: Optional[Tuple[str, str]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

climatology() new reference_period parameter

Comment on lines 387 to 390
reference_period : Optional[Tuple[str, str]], optional
The climatological reference period, which is a subset of the entire
time series. This parameter accepts a tuple of strings in the format
'yyyy-mm-dd'. For example, ``('1850-01-01', 1899-12-31')``.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description of reference_period parameter.

@@ -485,6 +504,7 @@ def departures(
freq: Frequency,
weighted: bool = True,
keep_weights: bool = False,
reference_period: Optional[Tuple[str, str]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

departures() new reference_period parameter

Comment on lines 644 to 654
# 1. Get the grouped average of the observation data variable.
ds_avg = ds.temporal.group_average(
data_var, self._freq, weighted, keep_weights, season_config=season_config
)
dv_obs = ds_avg[data_var].copy()

# 2. Group the observation data variable by the departures frequency.
# This step allows us to perform xarray's grouped arithmetic to calculate
# departures.
self._labeled_time = self._label_time_coords(dv_obs[self.dim])
dv_obs_grouped = self._group_data(dv_obs)
Copy link
Collaborator Author

@tomvothecoder tomvothecoder Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

departures() now includes updated logic for performing a group average of the observation data variable based on the select freq argument.

Grouping averaging was not previously implemented, which meant the departures data variable was on the same time frequency as in the input time coordinates (doesn't align with CDAT's behavior).

For example, with input time coordinates on a daily frequency:

  • Before: pass freq="season" -> departures time coordinates return as daily
  • Now: pass freq="season" -> departures time coordinates return as seasonal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()?

Comment on lines +886 to +931
def _is_valid_reference_period(self, reference_period: Tuple[str, str]):
try:
datetime.strptime(reference_period[0], "%Y-%m-%d")
datetime.strptime(reference_period[1], "%Y-%m-%d")
except (IndexError, ValueError):
raise ValueError(
"Reference periods must be a tuple of strings with the format "
"'yyyy-mm-dd'. For example, reference_period=('1850-01-01', "
"'1899-12-31')."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method that validates the reference_period argument

@tomvothecoder tomvothecoder marked this pull request as ready for review February 14, 2023 21:02
@codecov-commenter
Copy link

codecov-commenter commented Feb 14, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (e6dbed5) 100.00% compared to head (dbc2c63) 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #417   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1293      1308   +15     
=========================================
+ Hits          1293      1308   +15     
Impacted Files Coverage Δ
xcdat/temporal.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this PR seems to work great and makes sense to me. It takes me a long time to re-orient myself to the inner workings of the .temporal accessor so I'm hoping @lee1043 can also give this a careful review.

  • One thing I noticed is that calling ds.temporal.departures drops time_bnds – which I don't think needs to happen (I assume xarray doesn't know how to deal with this during the subtraction of the climatology). I think the time bounds can be added back in after subtraction. I noticed this in verifying that this PR was accurate (I get accuracy of order 10-4 – 10-7, I think because I added time bounds back in using add_missing_bounds, which is problematic).
  • Note to others that I did need to update my dev environment to pass the unit tests
  • Made some minor documentation suggestions
  • Steps zero and one of departures call _set_arg_attrs(), .copy(), _preprocess_dataset(), and group_average(). This overlaps strongly with _averager() – maybe things could be streamlined here (though I see that some of the order-of-operations is different, which may be why they are different.
  • As noted, departures calls group_average() on each call – I think this may not be necesary if the input/output frequency are the same?

Comment on lines 644 to 654
# 1. Get the grouped average of the observation data variable.
ds_avg = ds.temporal.group_average(
data_var, self._freq, weighted, keep_weights, season_config=season_config
)
dv_obs = ds_avg[data_var].copy()

# 2. Group the observation data variable by the departures frequency.
# This step allows us to perform xarray's grouped arithmetic to calculate
# departures.
self._labeled_time = self._label_time_coords(dv_obs[self.dim])
dv_obs_grouped = self._group_data(dv_obs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing this. But what if the input/output frequencies are the same? Should we still call group_average()?

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Feb 23, 2023

  • One thing I noticed is that calling ds.temporal.departures drops time_bnds – which I don't think needs to happen (I assume xarray doesn't know how to deal with this during the subtraction of the climatology). I think the time bounds can be added back in after subtraction. I noticed this in verifying that this PR was accurate (I get accuracy of order 10-4 – 10-7, I think because I added time bounds back in using add_missing_bounds, which is problematic).

Before this PR:

  • departures() used to maintain time_bnds because the output frequency was the same as the input frequency (no group_average() call).

In this PR:

  • Since we now call group_average(), time_bnds are dropped because the time coordinates are grouped which changes the time frequency and shape.

Solutions/Workarounds:

  1. As you mentioned, get new time_bnds for the new grouped time coordinates using add_missing_bounds(). I think the accuracy should improve with [Feature]: Add functionality to accurately set time bounds #412.
  2. Keep a separate time_original dimension/coordinates to keep time_bnds in the dataset (otherwise we get time_bnds with NaT values since time frequency and shape changes).

  • Steps zero and one of departures call _set_arg_attrs(), .copy(), _preprocess_dataset(), and group_average(). This overlaps strongly with _averager() – maybe things could be streamlined here (though I see that some of the order-of-operations is different, which may be why they are different.

I saw your point and played around with different implements using _average() to tried to find ways to streamline the implementation. I wasn't able to find a simple way to call self._average() without making significant modifications to how the TemporalAccessor class attributes are set (e.g., self._mode). We can probably try refactoring this in the future if we wanted to.

I updated the order-of-operations and made clearer comments in the departures() method so it is hopefully easier to understand. Please check out this commit diff and let me know if those comments make sense.


  • As noted, departures calls group_average() on each call – I think this may not be necesary if the input/output frequency are the same?

I also don't think we need to perform group averaging if the input/output frequencies are the same. It still works either way (and converts time coordinates to cftime if they aren't already), but it might improve performance by removing unnecessary operations.

We can try determining if group_average() is necessary by using the existing _infer_freq() method and comparing the result to the input freq argument.

Comment on lines +659 to +662
# The climatology and grouped average APIs are called on the copied
# dataset to create separate instances of the `TemporalAccessor` class.
# This is done to avoid overriding the attributes of the current
# instance of `TemporalAccessor` (set in step #1 above).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the departures() performance or memory usage is ever an issue, we can look into refactoring these steps to avoid copying. Copying datasets multiple times might not be an issue though because xr.Dataset.copy() defaults to deep=False ("... a shallow copy of each of the component variable is made, so that the underlying memory region of the new dataset is the same as in the original dataset.")

On the other hand, xr.DataArray.copy() defaults to deep=True.

Comment on lines +664 to +675
# Group averaging is only required if the dataset's frequency (input)
# differs from the `freq` arg (output).
ds_obs = ds.copy()
inferred_freq = self._infer_freq()
if inferred_freq != freq:
ds_obs = ds_obs.temporal.group_average(
data_var,
freq,
weighted,
keep_weights,
season_config,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added conditional for group averaging only if the input and output frequencies are the same. This should save on performance costs for this case.

Related to #417 (comment) and #417 (review).

@tomvothecoder tomvothecoder requested a review from jasonb5 March 1, 2023 21:04
Copy link
Collaborator

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thank you for adding the conditional for group averaging. I think this may speed up the code and also has the side effect of leaving the bounds.

If you accept my modifications in #418, there will need to be some minor changes when calling _infer_freq.

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My final self-review. I'm going to merge after the build passes.

@jasonb5's code review for departures() is still useful.

Comment on lines 641 to +723

# Preprocess the dataset based on method argument values.
# 2. Copy the original dataset and preprocess (if needed) for reuse.
# ----------------------------------------------------------------------
ds = self._dataset.copy()

# Preprocess ahead of time to avoid needing to preprocess again when
# calling the group average or climatology APIs in step #3.
ds = self._preprocess_dataset(ds)

# Group the observation data variable.
dv_obs = _get_data_var(ds, data_var)
self._labeled_time = self._label_time_coords(dv_obs[self.dim])
dv_obs_grouped = self._group_data(dv_obs)
# 3. Calculate the grouped average and climatology of the data variable.
# ----------------------------------------------------------------------
# The climatology and grouped average APIs are called on the copied
# dataset to create separate instances of the `TemporalAccessor` class.
# This is done to avoid overriding the attributes of the current
# instance of `TemporalAccessor` (set in step #1 above).

# Group averaging is only required if the dataset's frequency (input)
# differs from the `freq` arg (output).
ds_obs = ds.copy()
inferred_freq = self._infer_freq()
if inferred_freq != freq:
ds_obs = ds_obs.temporal.group_average(
data_var,
freq,
weighted,
keep_weights,
season_config,
)

# Calculate the climatology of the data variable.
ds_climo = ds.temporal.climatology(
data_var, freq, weighted, keep_weights, season_config
data_var,
freq,
weighted,
keep_weights,
reference_period,
season_config,
)
dv_climo = ds_climo[data_var]

# Rename the time dimension using the name from the grouped observation
# data. The dimension names must align for xarray's grouped arithmetic
# to work. Otherwise, the error below is thrown: `ValueError:
# incompatible dimensions for a grouped binary operation: the group
# variable '<CHOSEN_FREQ>' is not a dimension on the other argument`
# 4. Group the averaged data variable values by the time `freq`.
# ----------------------------------------------------------------------
# This step allows us to perform xarray's grouped arithmetic to
# calculate departures.
dv_obs = ds_obs[data_var].copy()
self._labeled_time = self._label_time_coords(dv_obs[self.dim])
dv_obs_grouped = self._group_data(dv_obs)

# 5. Align time dimension names using the labeled time dimension name.
# ----------------------------------------------------------------------
# The climatology's time dimension is renamed to the labeled time
# dimension in step #4 above (e.g., "time" -> "season"). xarray requires
# dimension names to be aligned to perform grouped arithmetic, which we
# use for calculating departures in step #5. Otherwise, this error is
# raised: "`ValueError: incompatible dimensions for a grouped binary
# operation: the group variable '<FREQ ARG>' is not a dimension on the
# other argument`".
dv_climo = ds_climo[data_var]
dv_climo = dv_climo.rename({self.dim: self._labeled_time.name})

# Calculate the departures for the data variable, which uses the formula
# observation - climatology.
# 6. Calculate the departures for the data variable.
# ----------------------------------------------------------------------
# departures = observation - climatology
with xr.set_options(keep_attrs=True):
ds_departs = ds.copy()
ds_departs[data_var] = dv_obs_grouped - dv_climo
ds_departs[data_var] = self._add_operation_attrs(ds_departs[data_var])
dv_departs = dv_obs_grouped - dv_climo
dv_departs = self._add_operation_attrs(dv_departs)
ds_obs[data_var] = dv_departs

# The original time coordinates are restored after grouped
# subtraction. The grouped time coordinates are dropped since they
# are no longer required.
ds_departs = ds_departs.drop_vars(self._labeled_time.name)
# The original time dimension name is restored after grouped
# arithmetic, so the labeled time dimension name is no longer needed
# and therefore dropped.
ds_obs = ds_obs.drop_vars(self._labeled_time.name)

if weighted and keep_weights:
self._weights = ds_climo.time_wts
ds_departs = self._keep_weights(ds_departs)
ds_obs = self._keep_weights(ds_obs)

return ds_departs
return ds_obs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jasonb5, I'm tagging you for code review mainly with the departures() method.

Let me know if the comments make sense. As mentioned in the meeting, we can revisit refactoring the implementation if needed. In my opinion I think it currently works well and the new comments explain the intention behind the implementation design.

@tomvothecoder
Copy link
Collaborator Author

This looks good to me. Thank you for adding the conditional for group averaging. I think this may speed up the code and also has the side effect of leaving the bounds.

If you accept my modifications in #418, there will need to be some minor changes when calling _infer_freq.

Thanks for the review Steve. After this PR is merged, we can rebase #418 on main and update that _infer_freq. call.

@tomvothecoder tomvothecoder force-pushed the feature/387-departures-ref-period branch from db2e5f9 to dbc2c63 Compare March 8, 2023 17:38
@tomvothecoder tomvothecoder merged commit 1d88dac into main Mar 8, 2023
@tomvothecoder tomvothecoder deleted the feature/387-departures-ref-period branch March 8, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Allow user selection of reference period in departures
3 participants