Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Consolidate verification tasks using retrieve_data.py #864

Merged
merged 24 commits into from
Aug 15, 2023

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Jul 19, 2023

DESCRIPTION OF CHANGES:

As described in Issue #850, the scripts for retrieving observation data for verification are quite convoluted and hard-to-maintain due to the constant changing of naming conventions on HPSS. We can leverage the functionality of the retrieve_data.py script to greatly simplify and consolidate this code, and that's what this PR achieves.

In addition, many other bug fixes and improvements are included:

  • The j-Jobs and exscripts for the retrieving of CCPA, MRMS, NDAS, and NOHRSC data are consolidated into a single set of scripts.
  • Verification tests that use staged data on disk no longer need to manually deactivate "get_obs" tasks; now tasks check if staged data was specified and ensure it is all present
  • Obs locations (CCPA_OBS_DIR, MRMS_OBS_DIR, NDAS_OBS_DIR, and NOHRSC_OBS_DIR) no longer need a specific directory structure, though the file format remains the same
  • Major improvements to mrms_pull_topofhour.py; now has a main() function for importing, uses proper command-line arguments with argparse, uses f-strings to simplify print statements, and raises an exception if no observation file is found instead of failing silently
  • Fixes bug in retrieve_data.py where filenames in tarballs containing full paths are not properly extracted
  • Removes "choices" for --data_type argument in retrieve_data.py: the data type is customizable via the config file (which is also customizable), so we shouldn't hard-code arbitrary choices. Missing data types are gracefully handled later anyway.
  • Add all verification tests to a coverage suite
  • Add new test custom_ESGgrid_Great_Lakes_snow_8km that specifies a custom domain for a snowstorm in February 2023, re-using data that was staged for the AQM test. This is part of an effort to expand testing (finishing Issue Overhaul and consolidate WE2E tests, identify needed additional tests #587) that will be completed in a subsequent PR, including explicit tests for winter precipitation verification.
  • Add cold-air-damming test to all comprehensive test lists, not just orion.
  • Fixes bug described in Issue New test experiment MET_ensemble_verification_only_vx_time_lag can not be created due to invalid variable "NET" #851. Test MET_ensemble_verification_only_vx_time_lag now works on most platforms, not just hard-coded to Hera
  • Comments related to verification obs in config_defaults.yaml are consolidated and relevant information moved to new consolidated obs-pulling script

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

Ran all verification tests (including new one) on Jet and Hera successfully. Also ran all tests by retrieving data from HPSS, rather than using staged data; all passed. Ran Fundamental tests on Jet as well.

  • hera.intel
    • Ran all verification tests (including new one) and also modifed all verification tests to pull obs data from HPSS (bypassing staged data). All passed.
  • cheyenne.intel
    • Ran MET_verification_only_vx, MET_ensemble_verification, and MET_ensemble_verification_only_vx
  • jet.intel
    • Ran all verification tests (including new one) and also modifed all verification tests to pull obs data from HPSS (bypassing staged data). Also ran fundamental tests.
  • fundamental test suite
    • Jet

Note: After rebasing latest changes, ran fundamental and all verification tests on Hera; all passed.

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@mkavulich
Copy link
Collaborator Author

@MichaelLueken This branch has been rebased with the latest changes from develop, so should be in its final state pending reviewer comments.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich -

Overall these changes look good to me!

I do have some concerns about adding the 2020_CAD test to comprehensive.cheyenne and the new custom_ESGgrid_Great_Lakes_snow_8km to comprehensive.

The 2020_CAD test only has data staged on Cheyenne. All other machines are able to pull data from AWS, but that wasn't possible on Cheyenne. Since staged data isn't available on all machines, there is no USE_USER_STAGED_EXTRN_FILES in the test's config file. In order to add this test to comprehensive.cheyenne, the staged data on Cheyenne would need to be propagated to the rest of the machines and USE_USER_STAGED_EXTRN_FILES: true would need to be added. It should be fine to add this test to comprehensive, but should be removed from comprehensive.cheyenne (unless the data is added to the rest of the Tier-1 platforms).

The new custom_ESGgrid_Great_Lakes_snow_8km test requires access to HPSS to pull the necessary verification data. While this test has been successfully run on both Hera and Jet, Gaea doesn't have the necessary hpss module that can be loaded to pull the data from HPSS. Would adding this test to both the coverage.jet and coverage.hera.gnu.com (or coverage.hera.intel.nco) suites and removing it from comprehensive be the better path forward?

tests/WE2E/machine_suites/comprehensive.cheyenne Outdated Show resolved Hide resolved
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich -

Thanks for addressing my concern regarding the 2020_CAD test in the comprehensive.cheyenne file! Approving these changes now.

@VanderleiVargas-NOAA
Copy link

@mkavulich

It seems that some of the tests are failing on Hera. The following tests succeeded:

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
MET_ensemble_verification_only_vx
MET_verification_only_vx

While the following tests failed:

custom_ESGgrid_Great_Lakes_snow_8km
MET_ensemble_verification_only_vx_time_lag

On the develop branch all tests succeeded except for custom_ESGgrid_Great_Lakes_snow_8km. Since this last one is a new config file, I simply copied it and tried to use it on the develop branch.

In all the cases where the run did not succeed, the code stopped while executing the following tasks: get_obs_ccpa, get_obs_nohrsc, get_obs_mrms, get_obs_ndas.

@mkavulich
Copy link
Collaborator Author

@VanderleiVargas-NOAA can you point me to your tests on Hera? I have run those tests on Hera (with Intel) and they passed on my end.

@VanderleiVargas-NOAA
Copy link

@mkavulich

the outputs are at
/scratch2/BMC/fv3lam/Vanderlei.Vargas/expt_dirs/custom_grid_issue_850

the SRW app that I used is at
/scratch2/BMC/fv3lam/Vanderlei.Vargas/ufs-srweather-app_mike

…sion may have bugs, I am committing so I can test on another platform since my Jet "fairshare" is apparently off-the-charts low
…e tarball contains a full path, we need to lstrip the leading "/" before globbing
All tasks now working for simple test case! Next up is to try all WE2E
vx tests
…ctory, we need to create it prior to calling the script
obs for tests MET_verification, MET_verification_only_vx, and
MET_ensemble_verification (modified to pull rather than use staged data)

 - Remove old j-jobs and exscripts
 - Cleanup and update mrms_pull_topofhour.py to be more pythonic
 - Move to obs dir prior to pulling data to avoid potential conflicts
from temporarily extracted files from HPSS
 - Move many comments from config_defaults.yaml to more appropriate
location in new data pulling/staging script; add more details about data
quirks
…verification tests now pass when retrieving from HPSS!
 - Remove usage of "*_vrfy" functions due to issues mentioned in Issue
   861 (ufs-community#861)
 - Remove extra "mkdir" statement where it shouldn't be for MRMS
 - Use correct variable (vhh_noZero) for checking valid hour for NDAS
 - Remove stanzas eliminating the "task_get_obs*" tasks; these should
   now be run every time (if using staged data, they will check that the
   data exists)
 - Modify tasks RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP
   (MET_verification) and MET_ensemble_verification_only_vx_time_lag to
   retrieve obs data from HPSS
 - Fix MET_ensemble_verification_only_vx_time_lag test by correcting
   variable NET to NET_default (addresses issue 851)
 - Add new verification task custom_ESGgrid_Eastern_NA_snow_8km that
   will have winter weather verification tasks (for subsequent PR)
 - Don't retrieve obs data from HPSS for
   grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP; too many
   platforms use this test
 - Rename custom_ESGgrid_Eastern_NA_snow_8km to
   custom_ESGgrid_Great_Lakes_snow_8km; add to comprehensive and Jet
   coverage test
 - Add MET_ensemble_verification_only_vx to comprehensive for Cheyenne
   and Orion, as well as to coverage suite via Hera/GNU
 - Move MET_verification_only_vx coverage from Hera/GNU to Cheyenne/GNU
 - Add new mysterious "2020_CAD" test to all comprehensive lists, not
   just Orion (what's up with that?), sort alphabetically
 - Remove stand-alone j-jobs and exscripts for NOHRSC
 - Add NOHRSC_obs stanza to parm/data_locations.yml
 - Update documentation and comments to include NOHRSC obs
 - Add "ASNOW" (accumulated snow) verification to
   custom_ESGgrid_Great_Lakes_snow_8km
 -
…E testing purposes, similar to TEST_ALT_EXTRN_MDL_SYSBASEDIR_ICS
@mkavulich
Copy link
Collaborator Author

mkavulich commented Aug 15, 2023

@MichaelLueken The conflicts have been resolved and I commented out the two tests we discussed from the coverage suite files. Let me know if you need anything else from my end.

Edit: to ensure the conflict resolution didn't break anything major I ran the verification and fundamental tests on Hera/Intel

@MichaelLueken MichaelLueken added the run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile label Aug 15, 2023
@MichaelLueken
Copy link
Collaborator

The Cheyenne Intel WE2E coverage tests were run on Hera and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE               7.24
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              29.03
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              19.14
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              25.14
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE               8.43
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              16.38
pregen_grid_orog_sfc_climo                                         COMPLETE               6.25
specify_template_filenames                                         COMPLETE               7.25
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             118.86

The Cheyenne GNU WE2E coverage tests were run on Hera and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              20.57
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE             228.63
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              28.11
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              37.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              82.52
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              22.53
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE             314.56
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              49.54
MET_verification_only_vx                                           COMPLETE               0.12
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE              10.07
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             793.83

Manual testing of MET_ensemble_verification_only_vx_time_lag on Hera GNU:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               7.67
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE               7.67

The Gaea WE2E coverage tests were manually run and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              20.68
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              26.89
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              30.83
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              33.32
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             354.19
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              31.69
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             355.89
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE               9.69
nco_ensemble                                                       COMPLETE              81.59
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             349.73
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1294.50

The Jet WE2E coverage tests were manually run and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              17.01
custom_ESGgrid                                                     COMPLETE              17.80
custom_GFDLgrid                                                    COMPLETE               8.46
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018         COMPLETE               9.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h     COMPLETE              52.35
get_from_HPSS_ics_RAP_lbcs_RAP                                     COMPLETE              16.02
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             235.16
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              32.01
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               7.20
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             523.06
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       COMPLETE              10.77
process_obs                                                        COMPLETE               0.22
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             929.91

Manual testing of custom_ESGgrid_Great_Lakes_snow_8km on Jet:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Great_Lakes_snow_8km                                COMPLETE              15.17
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              15.17

The Hera and Orion Jenkins tests successfully passed. Now moving forward with merging this work.

@MichaelLueken MichaelLueken merged commit 3dee70b into ufs-community:develop Aug 15, 2023
@MichaelLueken MichaelLueken mentioned this pull request Aug 18, 2023
33 tasks
MichaelLueken pushed a commit that referenced this pull request Sep 8, 2023
…sts, remove unsupported domains (#871)

Script improvements
 * run_WE2E_tests.py
    * Adds ability to run a group of all tests in a subdirectory of test_configs
    * Replace --use_cron_to_relaunch with a --launch argument that can take the values "python", "cron", or "none" (the last of which will create the experiments but not run them)
 * monitor_jobs.py
    * Adds --mode flag, which if specified as "advance" will only run rocotorun once for each experiment, then quit.
 * generate_FV3LAM_wflow.py and set_FV3nml_sfc_climo_filenames.py
    * Add --debug flag that will give more verbose output. No longer read "VERBOSE" variable from config.yaml for this script.
    * Make most prints "debug only"
 * get_crontab_contents.py
    * Overhaul script to remove global variables, add arguments as needed
    * Fixes bug where submitting jobs from cron won't work if your crontab is empty
    * Change functionality so that the script will remove the specified line from crontab if "--remove" flag is provided, otherwise it prints the crontab contents
New tests
 * Several new custom domain tests are added in a new custom_grids directory. The new custom domains were chosen to span a variety of locations, terrain types, and dates. Aside from custom_ESGgrid_Great_Lakes_snow_8km, these are basic tests that don't use many non-default settings, and so are good candidates for testing new SRW capabilities in the future. See the "Documentation" section for more details.
 * A new "long forecast" test (108 hours) starting at 2023060112 that retrieves FV3GFS grib2 input data from AWS (and so can not be run on Cheyenne).
Additional changes
 * Removes unsupported domains:
    * EMC_AK
    * EMC_HI
    * EMC_PR
    * EMC_GU
    * GSL_HAFSV0.A_25km
    * GSL_HAFSV0.A_13km
    * GSL_HAFSV0.A_3km
    * GSD_HRRR_AK_50km
 * For HPSS tests, ensure they all are explicitly set to look for HPSS data only
 * Test custom_ESGgrid_Great_Lakes_snow_8km (introduced in [develop] Consolidate verification tasks using retrieve_data.py #864) moved from verification to custom_tests, with a symbolic link (test_configs/verification/config.MET_verification_winter_wx.yaml) left behind. Also the ICs/LBCs are now from RAP output, retrieved from HPSS (since the observation data requires HPSS access anyway this is fine)
 * Make new comprehensive test file for Gaea (symlink to Orion's file) since that machine also does not have HPSS access
 * Various comment fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile
Projects
Status: Done
5 participants