Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Gen-Ens-Prod to standardize ensemble members relative to climatology. #1918

Closed
21 tasks done
j-opatz opened this issue Sep 14, 2021 · 8 comments · Fixed by #2061 or #2076
Closed
21 tasks done

Enhance Gen-Ens-Prod to standardize ensemble members relative to climatology. #1918

j-opatz opened this issue Sep 14, 2021 · 8 comments · Fixed by #2061 or #2076
Assignees
Labels
MET: Ensemble Verification requestor: NOAA/CPC NOAA Climate Prediction Center requestor: UK Met Office United Kingdom Met Office required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone type: new feature Make it do something new
Milestone

Comments

@j-opatz
Copy link
Contributor

j-opatz commented Sep 14, 2021

Describe the New Feature

Based on feedback from both the MetOffice and CPC, standardization of ensemble members relative to the ensemble's climatology mean and standard deviation is common practice. The most popular version of this is subtract mean/divide by stdev, but can also including subtract mean only. Others exist, but these seem the most desired after discussion with both offices.
An improvement to EnsembleStat that allows for a user-specified standardization, perhaps along the lines of a variable (i.e. normalize = CLIMO_MEAN), would benefit all offices that work with Ensemble climatology data.

Acceptance Testing

Best dataset to use for CPC would be NMME. Can be moved to more optimal location once development begins.
Because the standardizing of subtract mean/divide by stdev is currently being accomplished in METplus via Python Embedding, there are results to compare the improved functionality to. The subtract mean option does not have previous results currently.

Time Estimate

3 days

Sub-Issues

Consider breaking the new feature down into sub-issues.
No sub-issues needed

Relevant Deadlines

The reorg of EnsembleStat

Funding Source

2799991

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Select component(s)
  • Select priority
  • Select requestor(s)

Projects and Milestone

  • Select Repository and/or Organization level Project(s) or add alert: NEED PROJECT ASSIGNMENT label
  • Select Milestone as the next official version or Future Versions

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding source.
  • Fork this repository or create a branch of develop.
    Branch name: feature_<Issue Number>_<Description>
  • Complete the development and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into develop.
    Pull request: feature <Issue Number> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Linked issues
    Select: Repository level development cycle Project for the next official release
    Select: Milestone as the next official version
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Close this issue.
@j-opatz j-opatz added type: new feature Make it do something new priority: high requestor: UK Met Office United Kingdom Met Office alert: NEED ACCOUNT KEY Need to assign an account key to this issue requestor: NOAA/CPC NOAA Climate Prediction Center required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone MET: Ensemble Verification labels Sep 14, 2021
@j-opatz j-opatz added this to the MET 10.1.0 milestone Sep 14, 2021
@JohnHalleyGotway JohnHalleyGotway added the alert: NEED MORE DEFINITION Not yet actionable, additional definition required label Sep 23, 2021
@j-opatz
Copy link
Contributor Author

j-opatz commented Feb 10, 2022

In order to describe this slightly better, I discussed the issue with Johnna and she provided the following succinct summary:

fcst standardized anomalies are calculated in model space, e.g. fcst_std_anom = (fcst - fcst_clim)/fcst_sd

obs standardized anomalies are calculated in obs space, e.g. obs_std_anom = (obs-obs_clim)/obs_sd

I'm including a few snippets of Johnna's code, which is currently doing the standardizing process outside of MET in a Python script:

# Define climatology for the lead and member of interest
clim = np.nanmean(full_fcst_array,axis=0)

# Define standard deviation for the lead and member of interest
stddev = np.nanstd(full_fcst_array,axis=0)

# Define anomalies and standardized anomalies (perhaps unnecessary)
for y in range(len(years)):
    anom[y,:,:] = full_fcst_array[y,:,:] - clim
    std_anom[y,:,:] = anom[y,:,:]/stddev

return clim, stddev, anom, std_anom`

I think if this is placed anywhere, it makes the most sense to be in gen-ens-prod. This is for ensemble members and requires climatology data, and keeps within the scope of gen-ens-prod:

The Gen-Ens-Prod tool generates simple ensemble products (mean, spread, probability, etc) from gridded ensemble member input files. While it processes model inputs, but it does not compare them to observations or compute statistics.

An argument could be made this would be useful in different tools (i.e. grid-stat), but unlike regridding that can apply to numerous situations, ensemble standardization is strictly for ensembles. Users wanting this function for non-ensemble work should be able to pass the files into gen-ens-prod in a mock-ensemble file list, and as long as the variable names are the same it should behave the same.

@TaraJensen TaraJensen removed the alert: NEED ACCOUNT KEY Need to assign an account key to this issue label Feb 17, 2022
@JohnHalleyGotway
Copy link
Collaborator

@j-opatz this is great info! Thanks.

Sounds like we need add an option to GenEnsProd to do the following:
For each ensemble member, be able to...

  1. subtract off the ens mean (FCST_ANOM)
  2. subtract off the ens mean and divide by the stdev (FCST_STD_ANOM)
  3. subtract off the climo mean (CLIMO_ANOM)
  4. subtract off the climo mean and divide by the stdev (CLIMO_STD_ANOM)

We could consider adding a new GenEnsProd config file option:

normalize_flag = NONE, FCST_ANOM, FCST_STD_ANOM, CLIMO_ANOM, CLIMO_STD_ANOM

Add a global attribute to the NetCDF output file indicating the normalization method applied.

Update GenEnsProd to compute the ensemble mean/stdev, as needed.
If required climo data isn't available, error out.

Once this is working well in GenEnsProd, consider adding options 3 and 4 for normalize_flag to Grid-Stat, Point-Stat, Series-Analysis, and Ensemble-Stat. However, if applying it to these tools, we'd need to normalize both the forecast and observation data. This work would be a separate issue.

@JohnHalleyGotway JohnHalleyGotway changed the title Standardize ensemble members relative to climatology Enhance Gen-Ens-Prod to standardize ensemble members relative to climatology Feb 17, 2022
@JohnHalleyGotway JohnHalleyGotway removed the alert: NEED MORE DEFINITION Not yet actionable, additional definition required label Feb 17, 2022
JohnHalleyGotway added a commit that referenced this issue Feb 19, 2022
@JohnHalleyGotway
Copy link
Collaborator

Making progress. Added the normalize_flag option, updated config files, added documentation.

Still need to:

  • Add a new unit test to test normalizing in all supported ways.
  • Figure out how to differentiate between the different methods in the output... note that it can be normalized differently for each entry in the ens.field array.

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Feb 19, 2022

@JohnHalleyGotway internal development notes:

  • rename normalize_flag config file option to just be normalize to make it more similar to the censor and convert options
  • do not automatically include the normalization type in the gen_ens_prod output variable names
  • do not include normalization type as a NetCDF output variable attribute since we don't do that when the data's been converted or censored
  • update the new unit test to set nc_var_str to customize the output variable names
  • moved normalize_data() utility function over to data_plane_util.h/.cc since we'll likely want to call it from other MET tools and switch it to pass DataPlane pointers rather than references
  • update the docs accordingly
  • write up a new issue to extend the usage of normalize to ps, gs, es, and sa (all tools that read climo data)

JohnHalleyGotway added a commit that referenced this issue Feb 19, 2022
…ore similar to the convert and censor_thresh/censor_val options.
JohnHalleyGotway added a commit that referenced this issue Feb 20, 2022
…e names or attributes. Normalizing the input data is similar to converting it or censoring it and that information is not written to the NetCDF output files. The nc_var_str config option can be used to customize the output variable names as the user sees fit.
JohnHalleyGotway added a commit that referenced this issue Feb 20, 2022
… in the vx_util library so that that functionality is available to other MET tools. ci-run-unit
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 20, 2022 that will close this issue
14 tasks
@j-opatz j-opatz changed the title Enhance Gen-Ens-Prod to standardize ensemble members relative to climatology Enhance Gen-Ens-Prod to standardize ensemble members relative to ensemble climatology Feb 23, 2022
@j-opatz
Copy link
Contributor Author

j-opatz commented Feb 23, 2022

After discussions with CPC, it's become more clear what steps need to taken to accomplish the goal of standardizing ensemble members to the ensemble climatology. Big thanks to Johnna for the clarification, including the majority of the description that follows.

Starting with new capabilities 1) and 2) from above,

subtract off the ens mean (FCST_ANOM)
subtract off the ens mean and divide by the stdev (FCST_STD_ANOM)

What's desired by CPC is the ability to calculate model anomaly and model standardized anomaly with respect to the temporal model climatology and temporal standard deviation. 1) and 2) were initially discussed as the ensemble mean, with no temporal aspect.

Described in a more Pythonic way, CPC's data are organized such that they have 1 netCDF file per forecast initialization, with m amount of ensemble members in each file, and l amount of leads. For example, for CFSv2, a file with the notation 198201 is a forecast initialized January 1982, will include l=12 leads, and m=24 members. Each file would have an array of [lead, member, lat, lon]. We'd open these in a loop over all the available initializations.

Loading t amount of files into an array in python gives a 4D array of raw_model_data [init, member, lat, lon]. There's no lead in this array, since the current use case usage focuses on 1 lead (lead 0). Loading in all initalizations from 1982-2010 at lead 0 the array will be raw_model_data[init | 29, member | 24, lat | 180, lon | 360].

Finally, to calculate the temporal model climatology, the average is taken over the "init" dimension for e.g. 1982-2010 (this can be any period). This will give an array of model_clim[member,lat,lon] that holds the average over 1982-2010 for each member and gridpoint.

Likewise, to calculate the temporal model standard deviation, the standard deviation is calculated over the "init" dimension for e.g. 1982-2010 (this can be any period). This will give an array of model_std_dev[member,lat,lon] that holds the standard deviation over 1982-2010 for each member and gridpoint.

Then it's a simply applying the necessary equation to gain the anomalies and standardized anomalies for each ensemble member:

FCST_ANOM[member,lat,lon] = raw_model_data[init=t,member,lat,lon] - model_clim.

FCST_STD_ANOM[member,lat,lon] = (raw_model_data[init=t,member,lat,lon] - model_clim)/model_std_dev.

This was briefly discussed in a meeting yesterday, and I think @JohnHalleyGotway correctly surmised that this is actually a two-tool solution: the temporal model climatology and temporal model standard deviation for each ensemble member should be obtained via series-analysis, and the FCST_ANOM, FCST_STD_ANOM variables could be obtained via GenEnsProd. In this way, it's actually more in line with capabilities 3) and 4),

subtract off the climo mean (CLIMO_ANOM)
subtract off the climo mean and divide by the stdev (CLIMO_STD_ANOM)

as the temporal model climatologies and standard deviations could be fed in via the climo_mean and climo_stdev libraries in the configuration file.

When asked if there is value in keeping the ability to measure and calculate the ensemble mean and ensemble standard deviation as described in the original 1) and 2) capabilities, Johnna provided

Its still a useful metric. For example, one could see how much the members deviate from the mean (for example, say your ensemble mean says the temperature fcst is 50C at a gridpoint, you could see how much the ensemble members range about that value). It would be a good contribution, though not necessarily what we're looking for in this particular instance.

If this functionality is already in place with the current work that's done, there's no reason to tear it back out. But if it requires additional work and code updates, I'm all in favor of dropping that functionality.

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Feb 25, 2022

@j-opatz testing revealed that additional logic is needed.
CPC processes the CFSV2 24-member ensemble (and also the NMME 120-member ensemble). Each of these ensemble members must be normalized relative to a 30 year average of that INDIVIDUAL MEMBER. In Gen-Ens-Prod the climo_mean and climo_stdev dictionaries provide climo data that is applied in the same way to all ensemble member inputs. However, what we need here is to define climo data separately for each member.

One solution is leveraging the existing MET_ENS_MEMBER_ID keyword in the config file. Recommend enhancing the processing of climatology data so that any instances of MET_ENS_MEMBER_ID are replaced by the actual string for the current member. That string could appear in the VarInfo name or level strings but could also appear in the file_name array.

The hope is that this change will simplify the application of Gen-Ens-Prod to NOAA/CPC evaluation of these ensembles.

JohnHalleyGotway added a commit that referenced this issue Feb 25, 2022
…s so that we can use it later, if needed, when reading climatological data which may also make use of that string.
JohnHalleyGotway added a commit that referenced this issue Feb 25, 2022
…nt variable when reading climatology data if the ens_member_ids config option has been set and the normalizing relative to climatology has been requested.
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 25, 2022 that will close this issue
15 tasks
JohnHalleyGotway added a commit that referenced this issue Mar 1, 2022
…larify what data is being read from which climo data files.
JohnHalleyGotway added a commit that referenced this issue Mar 2, 2022
…BER_ID to read climo data separately for each member.
JohnHalleyGotway added a commit that referenced this issue Mar 2, 2022
…BER_ID to read climo data separately for each member.
JohnHalleyGotway added a commit that referenced this issue Mar 2, 2022
* Per #1918, store the ensemble_member_id string in the EnsVarInfo class so that we can use it later, if needed, when reading climatological data which may also make use of that string.

* Per #1918, update gen_ens_prod to set the MET_ENS_MEMBER_ID environment variable when reading climatology data if the ens_member_ids config option has been set and the normalizing relative to climatology has been requested.

* Per #1918, add log messages to read_climo.cc and gen_ens_prod.cc to clarify what data is being read from which climo data files.

* Added documentation on MET_ENS_MEMBER_ID usage in climo file name

* updated usage langauge

* Per #1918, adding gen_ens_prod unit test to demonstrate using ENS_MEMBER_ID to read climo data separately for each member.

* Per #1918, adding gen_ens_prod unit test to demonstrate using ENS_MEMBER_ID to read climo data separately for each member.

Co-authored-by: j-opatz <59586397+j-opatz@users.noreply.github.com>
@JohnHalleyGotway JohnHalleyGotway changed the title Enhance Gen-Ens-Prod to standardize ensemble members relative to ensemble climatology Enhance Gen-Ens-Prod to standardize ensemble members relative to climatology. Mar 2, 2022
@JohnHalleyGotway JohnHalleyGotway removed a link to a pull request Mar 2, 2022
15 tasks
@JohnHalleyGotway JohnHalleyGotway linked a pull request Mar 2, 2022 that will close this issue
@georgemccabe
Copy link
Collaborator

Looking at the logic for EnsembleStat and GenEnsProd, I think you are right that the integer argument is not needed. I think it was needed previously while I was doing development, but the final solution does not actually require it.

If ctrl_info is not set, then I believe we still need the field info of the first field instead of NULL because we need to pass that information to read the control field.

@georgemccabe, while doing development for this feature, I got confused by the usage of 'ens_info->get_ctrl(int)'.

An existing call to EnsVarInfo::get_ctrl(int) can be seen on this line. And a new call that I added can be seen on this line.

I'm confused about the integer argument to that function. Here's the definition of it.

1. If a control file has been specified on the command line, presumably ctrl_info is set, and that VarInfo is returned.

2. If ctrl_info is not set, then we return the VarInfo from the ensemble input corresponding to that index.

Looking at how "get_ctrl(i_ens)" is called in ensemble_stat.cc and gen_ens_prod.cc... it's only called when reset is true, which is really only when i_ens = 0. So we'd always be using the VarInfo from the first ensemble input.

I'm wondering if EnsVarInfo::get_ctrl(int) should NOT have an integer argument? If ctrl_info is set, return it, and if not, just return NULL.

Or maybe I don't understand the logic here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MET: Ensemble Verification requestor: NOAA/CPC NOAA Climate Prediction Center requestor: UK Met Office United Kingdom Met Office required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone type: new feature Make it do something new
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants