Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains #871

Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
160220c
Add functions to setup.py that remove metatask dependencies for any m…
mkavulich Apr 25, 2023
d1c5054
Improving testing scripts
mkavulich May 23, 2023
f5ddd17
- Lint run_WE2E_tests.py (not 10/10 yet), add checks to ensure that
mkavulich Jul 26, 2023
cdda9b3
Add new custom domains of varying resolutions and locations. NZ grid …
mkavulich Apr 28, 2023
9af5c69
- Replace "use_cron_to_relaunch" argument with "launch" argument tha…
mkavulich Mar 30, 2023
6d068d2
- Add new hi-res (1.1km) SF area case
mkavulich Jul 26, 2023
be1648b
Explicitly look only on HPSS for HPSS tests
mkavulich Jun 1, 2023
51dbda0
Revert vx_only changes, this isn't a viable route to simplification
mkavulich Jul 26, 2023
c1c7720
Swap tests to get successes...UPP on gnu still seems buggy
mkavulich Jun 1, 2023
57efa83
Add long forecast (over 100 hours) case
mkavulich Jun 2, 2023
99fa01c
Add long_fcst to machine test files
mkavulich Jul 26, 2023
0facdd5
- Move new snow verification test to "custom_grids" directory, leave
mkavulich Jul 26, 2023
4b69dd9
Remove unused predefined domains
mkavulich Jul 26, 2023
eae54e4
- Add argparse for generate_FV3LAM_wflow.py; now includes "debug" flag
mkavulich Jul 27, 2023
0c92b5e
Gaea needs to use the same comprehensive suite as Orion (which also d…
mkavulich Jul 27, 2023
aee5c95
long_fcst test uses AWS data, so can not be run on Cheyenne
mkavulich Jul 27, 2023
e4a510b
Fix write component grids for new custom domains
mkavulich Jul 27, 2023
a81d2a1
Add new '--mode' flag to monitor_jobs.py to allow for "advance" mode …
mkavulich Jul 27, 2023
2f0d123
Fix SF test write componant grid
mkavulich Jul 28, 2023
f646f3b
Add default (False) for new debug argument, should fix failing test
mkavulich Jul 28, 2023
2b9ac58
Fix linting for generate_FV3LAM_wflow.py
mkavulich Jul 28, 2023
8311a06
Update new nco variable names in comments where they were missed
mkavulich Aug 4, 2023
4692d31
Jenkins tests do not use cron, so this option is unnecessary
mkavulich Aug 4, 2023
4e7829b
Updates to fix empty crontab issue
mkavulich Aug 16, 2023
e0f9f96
Change new custom domain layouts from get_layout.sh script
mkavulich Aug 16, 2023
7567188
Fix unit test for ush/get_crontab_contents.py
mkavulich Aug 16, 2023
57323d8
Make the linter happy 😑
mkavulich Aug 16, 2023
a01051e
Consolidate test for Quilting=False into existing test "grid_RRFS_CON…
mkavulich Aug 16, 2023
5062499
Since custom_ESGgrid_Great_Lakes_snow_8km has to pull HPSS data anywa…
mkavulich Aug 16, 2023
14759d8
long_fcst test retrieves from HPSS, so can not be run on Orion/Gaea
mkavulich Aug 16, 2023
ed8cfca
Missed a line in docstring
mkavulich Aug 16, 2023
3dce95f
Some fixes to crontab process:
mkavulich Aug 16, 2023
cea2e3e
Up requested walltime to 1 hour for new custom grids, just in case th…
mkavulich Aug 16, 2023
59ce3a8
Move New Zealand test away from Hera GNU (GNU compilers run significa…
mkavulich Aug 22, 2023
4c512e2
Update description for long_fcst test to be more accurate to what is …
mkavulich Aug 29, 2023
297008b
Add logic to not check commented crontab lines for matching command; …
mkavulich Aug 29, 2023
ecfc8a4
Better clarifying comments
mkavulich Aug 29, 2023
c647a4a
Simplify logic to an "else" statement
mkavulich Aug 29, 2023
8316c7d
Restore temporarily removed tests now that EPIC accounts can access H…
mkavulich Sep 5, 2023
7b8b883
Replace "jet-epic" --> "jet" per Michael Leuken
mkavulich Sep 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion tests/WE2E/machine_suites/comprehensive
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
2020_CAD
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_Great_Lakes_snow_8km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down Expand Up @@ -54,6 +59,7 @@ grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot
GST_release_public_v1
long_fcst
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_verification_only_vx
Expand All @@ -64,6 +70,5 @@ nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
pregen_grid_orog_sfc_climo
quilting_false
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS
specify_template_filenames
5 changes: 5 additions & 0 deletions tests/WE2E/machine_suites/comprehensive.cheyenne
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/comprehensive.gaea
5 changes: 5 additions & 0 deletions tests/WE2E/machine_suites/comprehensive.orion
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
2020_CAD
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down
4 changes: 2 additions & 2 deletions tests/WE2E/machine_suites/coverage.cheyenne
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_ESGgrid_IndianOcean_6km
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR
#nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
pregen_grid_orog_sfc_climo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Should we remove the "nco_" prefix from test names now that tests can be run in either NCO or community mode without having to specify the mode in the experiment config file?

Copy link
Collaborator Author

@mkavulich mkavulich Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including "nco" in these test names is still necessary, since they are explicitly run in NCO mode unless "community" is specified on the command line (or if it's a a community-mode coverage test). Now, whether or not these tests need to still exist is another question; it doesn't look like any of them are unique, so we could eliminate them altogether so long as all those capabilities are tested as part of the Hera NCO coverage suite. I think it's a good idea to remove the "nco_" tests entirely, so if you agree then I'll go ahead and do that. @MichaelLueken should probably approve of that plan as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back one minor point: there is one thing that is only tested in the "nco_" tests, and that is using pre-staged grid/orog/sfc_climo data. I can't remember, is that something that should be specific to NCO tasks, or can we roll that into another test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it this morning, I think we should defer this topic to a different PR, especially since @MichaelLueken is on leave this week. I opened an issue #895 to discuss this, feel free to chime in there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich A couple of things:

  1. I see 4 tests that start with "nco_" under ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_nco. I started checking if there are counterparts of these under grids_extrn_mdls_suites_community to see if the "nco_" tests can be eliminated but decided to not finish because it's a pretty time consuming task (at least a couple of hours). I made an issue at Remove any WE2E tests for NCO mode (names starting with "nco_") that test features that are already tested in community mode #896. Feel free to comment there if I didn't capture the issue correctly.
  2. Use of pre-staged grid/orog/sfc_climo data is not specific to NCO mode, but I'd say it is more important for NCO mode because (at least as of a couple of years ago) EMC (who always seem to run in NCO mode) did not want to have the make_[grid|orog|sfc_climo] tasks as part of their workflow, i.e. they always used pre-staged data. That's why NCO mode always uses pre-staged files (or at least used to; not sure now). But these are also important in community mode because once community-mode users get their grid/orog/sfc_climo files set up, they tend to want to skip those tasks in later experiments.

specify_template_filenames
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.cheyenne.gnu
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.gaea
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
community
custom_ESGgrid_NewZealand_3km
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
Expand Down
3 changes: 2 additions & 1 deletion tests/WE2E/machine_suites/coverage.hera.gnu.com
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the file name has .com extension?.. Was the intent to have *.nco file with the list of tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - The .com extension is to ensure that all tests in the file run in the community environment. Within this file, there is an nco test, nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16. When this test is launched, the nco environment is overwritten so that it will run in community mode.

This is similar to the coverage.hera.intel.nco file. All of the tests in this file are normally run in community mode, but the .nco extension ensures that the tests use the nco environment instead.

There are no plans to include either a coverage.hera.intel.com or coverage.hera.gnu.nco file.

Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
custom_ESGgrid_Peru_12km
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
quilting_false
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
GST_release_public_v1
long_fcst
MET_verification_only_vx
#MET_ensemble_verification_only_vx_time_lag Removed temporarily due to HPSS permissions issue
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.hera.intel.nco
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_ESGgrid_Central_Asia_3km
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems
get_from_HPSS_ics_HRRR_lbcs_RAP
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.orion
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_ESGgrid_SF_1p1km
deactivate_tasks
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
Expand Down
26 changes: 21 additions & 5 deletions tests/WE2E/monitor_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,23 @@
from utils import calculate_core_hours, write_monitor_file, update_expt_status,\
update_expt_status_parallel, print_WE2E_summary

def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug: bool = False) -> str:
def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1,
mode: str = 'continuous', debug: bool = False) -> str:
"""Function to monitor and run jobs for the specified experiment using Rocoto

Args:
expts_dict (dict): A dictionary containing the information needed to run
one or more experiments. See example file monitor_jobs.yaml
monitor_file (str): [optional]
mode (str): [optional] Mode of job monitoring
continuous (default): monitor jobs continuously until complete
advance: increment jobs once, then quit
debug (bool): [optional] Enable extra output for debugging

Returns:
str: The name of the file used for job monitoring (when script is finished, this
str: The name of the file used for job monitoring (when script is finished, this
contains results/summary)
"""

monitor_start = datetime.now()
# Write monitor_file, which will contain information on each monitored experiment
monitor_start_string = monitor_start.strftime("%Y%m%d%H%M%S")
Expand All @@ -52,6 +55,12 @@ def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug

write_monitor_file(monitor_file,expts_dict)

if mode != 'continuous':
logging.debug("All experiments have been updated")
return monitor_file
else:
logging.debug("Continuous mode: will monitor jobs until all are complete")

logging.info(f'Setup complete; monitoring {len(expts_dict)} experiments')
logging.info('Use ctrl-c to pause job submission/monitoring')

Expand Down Expand Up @@ -102,7 +111,8 @@ def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug
endtime = datetime.now()
total_walltime = endtime - monitor_start

logging.debug(f"Finished loop {i}\nWalltime so far is {str(total_walltime)}")
logging.debug(f"Finished loop {i}")
logging.debug(f"Walltime so far is {str(total_walltime)}")
#Slow things down just a tad between loops so experiments behave better
time.sleep(5)

Expand Down Expand Up @@ -160,6 +170,11 @@ def setup_logging(logfile: str = "log.run_WE2E_tests", debug: bool = False) -> N
parser.add_argument('-p', '--procs', type=int,
help='Run resource-heavy tasks (such as calls to rocotorun) in parallel, '\
'with provided number of parallel tasks', default=1)
parser.add_argument('-m', '--mode', type=str, default='continuous',
choices=['continuous','advance'],
help='continuous: script will run continuously until all experiments are'\
'finished.'\
'advance: will only advance each experiment one step')
parser.add_argument('-d', '--debug', action='store_true',
help='Script will be run in debug mode with more verbose output')

Expand All @@ -175,7 +190,8 @@ def setup_logging(logfile: str = "log.run_WE2E_tests", debug: bool = False) -> N
#Call main function

try:
monitor_jobs(expts_dict,args.yaml_file,args.procs,args.debug)
monitor_jobs(expts_dict=expts_dict,monitor_file=args.yaml_file,procs=args.procs,
mode=args.mode,debug=args.debug)
except KeyboardInterrupt:
logging.info("\n\nUser interrupted monitor script; to resume monitoring jobs run:\n")
logging.info(f"{__file__} -y={args.yaml_file} -p={args.procs}\n")
Expand Down
Loading
Loading