-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul and consolidate WE2E tests, identify needed additional tests #587
Comments
@mkavulich @EdwardSnyder-NOAA -
These tests can be found: If you would like, I can commit these two new tests in @EdwardSnyder-NOAA feature/rrfs-ensemble-bc branch so that they will be available once this work has been merged. |
I'm fine with adding those tests with my #526 PR. @MichaelLueken |
I'll be posting mildly unstructured thoughts here in the comments so they can hopefully be rolled into a more coherent proposal later. This one is about DomainsThere are several points I'd like to make about testing domains: Test all domains?There are 23 "pre-defined" domains in
The other 9 are untested:
Should we test all of these or remove them? I don't think there's any middle ground: either they are important enough to be kept (and so should be tested), or they are not, and should be removed. Custom domains?We currently only have a single custom domain for the ESG grid. I wonder if there are a variety of parameters we'd like to test with more custom grids? At the very least I would recommend that we add a custom domain for the southern hemisphere, since there is a lot of opportunities for hemisphere-specific errors. Removing domainsAside from my above recommendations (test or delete untested domains), is there any reason we are still including "GFDLgrid" as a valid grid type?. From my understanding, this is an obsolete grid that we do not support and should not waste resources on testing. That would remove the following tests:
And the following pre-defined grids:
|
Is anyone in the Community using the SUBCONUS or compact domains? Should they be supported? |
I believe SUBCONUS grids are used for tutorials, as they are very small domains and can run quickly even at high resolution. The "compact" domains were made in order to fit inside the HRRR CONUS domain, so that they can use HRRR as input BCs. I can see some utility in keeping them, though I'm open to other opinions. |
Another easy test improvementThere are two tests that take almost twice as long to finish as the next longest: grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 and grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16:. We should find a way to reduce these times, either by increasing the number of cores used, decreasing the forecast length, or some other change. |
Testing "suites"When originally discussing testing strategies (see #277), the decision was made to maintain two "suites" of tests, with the following intended purposes:
The idea being that in addition to targeted tests for their changes, a developer could run the The logic behind the While I do see the advantages of this test strategy, in my opinion it degrades the original use case of the As a solution to this problem, I would like to propose the following list of suites going forward, returning
|
Default wall timesThe history of default wall times is a bit all over the place. Originally it was much harder to specify on a case-by-case basis, so the default wall times were made very long to ensure jobs wouldn't time out. We now have a lot more flexibility, so we should re-visit the topic. Why do we care?Theoretically, jobs are only charged for the amount of time used (at least, that is the case on Cheyenne and NOAA RDHPC platforms). So there is seemingly no harm in having extra-long wall times specified for tests. However, there are some scenarios where unnecessarily long wall times can be inconvenient or even harmful. For example, one of the failure modes of the ufs weather model is that when it experiences a fault, it does not actually exit, but rather hangs, leading to the job spending core hours wastefully on what is usually the most expensive task. Secondly, paring down submission wall times will theoretically help jobs get into and out of the queue faster, as on most if not all machines, jobs with shorter wall times will be prioritized. What should we do?Most tasks have relatively reasonable default wall times of 30 minutes or less. However, the RUN_FCST step has a very long default wall time of 4.5 hours, which is way longer than we need for testing basic functionality. I would propose we modify all WE2E test files to reduce this walltime as much as possible: as most tests should complete this step in less than 30 minutes, and certainly an hour at most. I would propose that as a starting point. This would have the added benefit of capping the theoretical wall time of the full set of WE2E tests in general. And many other tasks could have other steps reduced in time, such as WTIME_GET_EXTRN_LBCS and WTIME_GET_EXTRN_ICS for tests where the data is stored locally. |
@mkavulich Here are some randomly arranged thoughts/questions on the issues raised above:
|
@gsketefian When adding new features, I would agree that it would be a best practice to either add a new WE2E test so that the new functionality can be tested, or modify an existing WE2E test so that it can also test the new feature. Since the idea of this issue is to ultimately reduce the number of WE2E tests, I would like to stress that modifying existing WE2E tests so that they can be used to test new features, would be the best path. |
Underutilized nodesThis has been brought up a time or two in history, but again, worth revisiting because we now have the ability to do something about it. Currently for all jobs except run_fcst, the number of cores per node is hard-coded, rather than being scaled based on the available cores per node. Using all nodes (except on machines like Hera where a single core specified results in a partial node use) would be an easy way to slightly reduce cost. |
Cost and state of existing testsAll tests core hour cost from example runs can be found in this spreadsheet: https://docs.google.com/spreadsheets/d/1npubj78GW1a3htk8Ksh_rDPiIRkjlaWNXo-gEQgpxXE/edit Hera
Jet
Cheyenne
|
PlanHere is an outline of the plan to overhaul, consolidate, improve, and expand the coverage of the WE2E tests suites. Stage 1: Consolidate existing testsThis step will involve combining tests of different parts of the system into a single test. For example, a test designed to test a particular grid and physics suite can have inline post added to it rather than having a stand-alone inline post test. Stage 2: Pare down long/expensive testsIdeally a WE2E test should take 30 minutes or less. Some tests are much longer than this, and should be reduced as much as possible. Stage 3: Overhaul suitesAs per this comment above, restore old behavior of fundamental and comprehensive suites, and add "coverage" test type. In addition, modify "comprehensive" suite so that known failures for a given platform are omitted. Stage 4: Add new capabilities/testsAdd un-tested options to existing tests as much as possible. Add new tests where necessary. |
Unit testsWe are now at the point where we could be testing some actual workflow capabilities (not just workflow generation) using unit tests. For example, tests for retrieving specific data (such as the As efforts to pythonize the workflow continue, we should continue to lean more and more heavily on lightweight unit tests like these rather than bulky WE2E tests. |
Update: Rebasing latest changes from develop on the feature branch, and plans to open PR this afternoon. Looks like significant cost savings for the comprehensive tests -- another 30% off the top. |
Major takeaways from first PR (#686):
Test changesRemoved grid
Removed tests
Combined testsIn the cases where one test was absorbed into another, a symlink remains for the old test
Changed testsThere were two new tests that have not been added to the comprehensive suite that are now added:
A number of tests still do not exist in the comprehensive suite because they just have not been added yet. Most of these are tests that are known to fail on one or more platforms; specifically a lot are tests sourcing input data from HPSS. I will have these handled more gracefully in the second round of changes, so they can be included only on platforms where they are known to succeed. And there remains one test that is still left out of the comprehensive suite: Several tests were modified to run quicker and more efficiently:
Additionally, the |
This is the first set of changes overhauling the WE2E test suites. This represents steps 1 and 2 of the overhaul process as described in issue #587. In addition, some quality-of-life improvements to the WE2E test scripts are included.
#732 Accomplishes most of the points mentioned in this issue. Still remaining are:
|
…rovements!) (#732) This test continues the overhaul of WE2E test suites as described in Issue #587 (specifically stage 3 and parts of stage 4 in this comment). The changes are summarized below, roughly in order of importance. * "fundamental" tests are replaced by "coverage" test suites. "fundamental" tests are returned to their original purpose: a lightweight set of tests to be run the same on all platforms. "coverage" tests now evenly distribute all comprehensive tests across all platform/compiler combinations for use in Jenkins testing. * "comprehensive" test list is updated to include all tests (except current known failures). For platforms that have known failures (for example, HPSS tests on on platforms without HPSS access), comprehensive.<platform>[.<compiler>] files are included to automatically run only the tests expected to succeed * Fix several existing failures * Use correct date format in grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 * In config_parser.py, when populating a jinja template, keep dates in string format rather than converting to a datetime object (this fixes problem with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS) * Fix unit tests for retrieve_data.py, there was a bug causing all tests to be run in nested subdirectories that eventually leads to failure when running all tests including HPSS retrieval * Remove several "get_from_HPSS" tests in favor of new unit tests for HPSS data in test_retrieve_data.yaml * Add several more dates and data sources to unit tests in test_retrieve_data.yaml * The example config files in the ush/ directory (config.community.yaml and config.nco.yaml) are now included as WE2E tests (symbolically linked in the tests/WE2E/test_configs/default_configs/ directory) * Remove long-known failing test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 (WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #359). This is now an old capability with only legacy support (global spectral model was retired in 2019) and there are no immediate plans to fix the bug. * WE2E_summary*.txt files are now written to the experiment directory rather than tests/WE2E * Updated data_locations.yaml for latest RAP files on HPSS * Reduce timeouts and delays between calls to wget to speed up remote data retrieval * run_MET_GridStat_vx_APCP tasks fail randomly on occasion; increasing maxtries to 2 mitigates this problem * Swap test of restart capabilty from grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8 for coverage reasons * Convert some print_info_msg messages to logging.debug calls to allow suppression of superfluous output if desired * A few miscellaneous minor fixes to log messages * Made some docstrings more consistent format * Removed some outdated documentation on validating config.yaml
Description
It has been a while since we re-visited the suite of WE2E tests. In that time, a lot of development has happened, and the way various test suites are used has changed. We should assess the current state of the WE2E tests, and discuss ways in which they can be improved.
This issue will be modified and expanded as the effort continues.
Examples to justify further work
Duplication
There are many examples of this, but just to pick out one, these nine tests only have a single difference among them: the specific output grid being used
These tests could be used to test other capabilities, or be rolled into other existing tests for other capabilities
Missing capabilities
There are important capabilities either already present or being introduced soon that are not currently covered by any WE2E test:
Retrieving data from AWS(added in [develop] Add GDAS and GEFS datasets to the workflow #526)Using GDAS and GEFS input data(added in [develop] Add GDAS and GEFS datasets to the workflow #526)Some others may or may not be appropriate for WE2E testing:
Other changes
Solution
In this issue I will describe a strategy for consolidating the WE2E tests into (ideally) fewer tests with wider coverage. This will involve a more in-depth assessment of the issues mentioned above, proposing changes to the set of tests, and strategizing the best way to maintain the WE2E tests going forward. This issue will eventually contain a proposal for changes to be made. @mkavulich will take the lead on this effort, but others should feel free to add to this discussion and make modifications.
Related issues/PRs
Issues
specify_template_filenames
does not actually test the specifying of template filenames #569PRs
The text was updated successfully, but these errors were encountered: