[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

MichaelLueken · 2023-05-15T20:42:16Z

DESCRIPTION OF CHANGES:

The WM and UPP hashes in the Externals.cfg file were updated to current versions.
The nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test that is constantly failing on Cheyenne GNU and the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test on Hera Intel have been moved to platforms where the tests successfully run.
Minor changes to the DT_ATMOS for the RRFS_CONUS_25km grid were made to allow the WE2E tests run after updating to the current UFS-WM hash. Please note that there is still an issue with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test on Jet.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes

…t out WE2E coverage tests that are currently failing.

…ld and run on Gaea

danielabdi-noaa

LGTM

mkavulich

@MichaelLueken Sorry I am not very up-to-date on current development due to other obligations: what commit(s) resulted in the failure for the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests? We shouldn't be allowing existing failures unless there is a very good reason for this.

This also reminds me of the issue around the known failure of grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (#731). I didn't have time to attempt a fix of this, but I was at least able to point to the change that resulted in this failure. We need to have someone assigned to fix these failures prior to the release, it shouldn't be something we just sweep under the rug.

MichaelLueken · 2023-05-16T16:26:50Z

@MichaelLueken Sorry I am not very up-to-date on current development due to other obligations: what commit(s) resulted in the failure for the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests? We shouldn't be allowing existing failures unless there is a very good reason for this.

This also reminds me of the issue around the known failure of grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (#731). I didn't have time to attempt a fix of this, but I was at least able to point to the change that resulted in this failure. We need to have someone assigned to fix these failures prior to the release, it shouldn't be something we just sweep under the rug.

@mkavulich No worries. The failures in nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km and grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta were first seen for the PRs that came in directly after PR #732 was merged - PRs #745 and #729, respectively.

The nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test passes on any machine that isn't Cheyenne, so it might be a similar issue to what was encountered in PR #732 with the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2test. The failure is also very reminiscent - only failing intermittently in the run_post tasks.

The grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test is failing in the run_fcst step, similar to what you noted in issue #731 for the grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test.

Since this PR is fairly open-ended, I can certainly attempt to work through these failures as part of this PR, rather than comment them out in the coverage tests.

For instance, it's possible that the reverting the modifications made to the model_configure file will allow both the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta and grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta tests to pass.

mkavulich · 2023-05-16T18:47:51Z

Thanks for the reply @MichaelLueken. I think it is okay to re-allocate tests to different machines if they are failing only on one or two (likely machine-specific memory/node problems), but these all-platform run_fcst failures are more serious. I apologize for bringing this up as a roadblock on your PR since you didn't introduce these failures, but I feel like it is very important to avoid setting the precedent of simply kicking the can down the road on test failures. At the very least, any known failure should be documented as a high-priority issue.

I'm curious to hear your thoughts on the best way forward here. I am willing to help investigate and open issues for existing failures if that would be helpful, but I do not have time to try to debug model segfaults and related mysterious errors caused by namelist/configure/openmp changes that I'm not very familiar with. If there is a solid plan with people assigned to solve the problem, I would be okay with temporarily removing these failing tests and merging this PR, but there must be a specific plan to resolve these failures soon, prior to the release. Especially since the grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta domain is specifically used for the RRFS.

… WE2E coverage tests.

Externals.cfg

MichaelLueken · 2023-05-23T20:14:58Z

@mkavulich It looks like the failure with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR is due to changes from a separate PR in the ufs-weather-model. I'm currently testing the hash from #1720, and this test is still failing. This might be something that we can correct in the SRW, but I need to still find which hash is causing the failure first.

… to pass with current UFS-WM hash, revert modulefiles/build_gaea_intel.lua to current develop, and comment out Gaea in Jenkinsfile.

MichaelLueken · 2023-05-26T19:27:56Z

The modulefiles/build_gaea_intel.lua modulefile has been reverted to the what is currently in develop. Once the weather model has merged @natalie-perlin's PR to use the latest hpc-stack location, a PR will be opened in the SRW App to make the same update. In the interim, Gaea has been deactivated in the Jenkinsfile (the SRW App is unable to build without an updated modulefile). Applied @mkavulich's fix to DT_ATMOS (reducing the value from 180 to 150) for the RRFS_CONUS_25km grid. Currently, the only test that is failing is the Jet test - grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR. Will continue to try and find the hash that causes this test to fail and will hopefully be able to correct it at that time.

… allow unit tests to pass.

MichaelLueken · 2023-05-31T15:46:09Z

@mkavulich I have gone through all of the preceding hashes and it looks like the issue with the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test was brought about by the changes in UFS-WM hash 889254a and is associated with PR #1658. I will now look into what changes might be causing this test to fail. If you see anything in this PR that looks like the culprit, please let me know. Thank you very much!

mkavulich · 2023-06-02T20:07:53Z

Thanks @MichaelLueken for your continued work on this. I have some welcome (if weird) news! I made a run of the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR case on Jet with an increased timestep of DT_ATMOS=20 to see if I could get the failure to occur closer to the first output forecast hour (to see if I could directly observe some of the developing numerical instability). And yet when I did so, the case ran to completion! I also tried DT_ATMOS=30 and DT_ATMOS=45 and they both succeeded as well...so I think we can be comfortable increasing the default DT_ATMOS for that domain to either of those values! 30 is probably safer.

My strong suspicion is that the changes in ufs-community/ccpp-physics#43 by Joe Olson are responsible for the failures, since HRRR includes the MYNN-EDMF scheme that was heavily modified in that PR. It is very curious that increasing the time step would resolve this failure, I may drop Joe a note to see if this behavior makes any sense to him.

MichaelLueken · 2023-06-02T20:22:24Z

@mkavulich Very welcome news indeed! Thank you so much for assisting with trying to figure out what was causing issues with the updated WM hash! I've set the DT_ATMOS to 30 for this grid and have resubmitted the test. Once it is complete, I will push the change. Have a great weekend!

…llow grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test to run to completion on Jet.

…-srweather-app into feature/WM-hash-update

MichaelLueken · 2023-06-07T15:31:44Z

@mkavulich The changes made to the DT_ATMOS value to make the specify_template_filenames, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2, and GST_release_public_v1 tests pass using the updated WM hash, decreasing the value from 180 to 150, is now causing the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test to fail on Cheyenne Intel. Values of 160 and 170 were also tested for this case, but ultimately resulted in failure. Just wanted to let you know why this work hasn't been merged as of this time.

mkavulich · 2023-06-07T16:12:32Z

@MichaelLueken It doesn't make any sense at all that changing the time step would cause plotting to fail (even less sense than increasing time steps to solve a CFL violation; at least that is all related to the dynamical core!). If valid output is produced, then the plotting scripts should be able to plot it.

Does the test succeed if you manually change DT_ATMOS back to 180 for that plotting test?

MichaelLueken · 2023-06-07T16:22:14Z

@MichaelLueken It doesn't make any sense at all that changing the time step would cause plotting to fail (even less sense than increasing time steps to solve a CFL violation; at least that is all related to the dynamical core!). If valid output is produced, then the plotting scripts should be able to plot it.

Does the test succeed if you manually change DT_ATMOS back to 180 for that plotting test?

@mkavulich I completely agree. Using DT_ATMOS of 150, everything successfully passes, except for the plotting task. If the rest of the tasks successfully complete, then the plotting script should be able to successfully plot. However, this isn't the case. Yes, changing DT_ATMOS back to 180 allows the plotting test to successfully run. When DT_ATMOS was set to 160 and 170, there were failures in the post tasks. Further decreasing DT_ATMOS to 140 causes the run_fcst job to fail. The failure in the plotting script using DT_ATMOS 150 appears to be related to a self-intersection point:

ERROR: TopologyException: side location conflict at -97.608377902731846 34.009777791998118
INFO: Self-intersection at or near point -97.878776042255282 34.00918460186481

Again, setting DT_ATMOS to 180 solves this failure. Of course, using this value also causes three tests to fail.

mkavulich · 2023-06-07T16:36:43Z

Further decreasing DT_ATMOS to 140 causes the run_fcst job to fail.

This is problematic. Similar to the AK domain problem, but in this case it's not using the MYNN-EDMF scheme....so your guess is as good as mine.

The hacky solution would be to change back to DT_ATMOS=180 just for the plotting test. But I am worried that we are just masking potential problems here that users will start to run into if they make any modifications. Do you know if there is anyone on the weather model team more familiar with the dynamical core that can offer us any assistance here?

MichaelLueken · 2023-06-07T16:58:32Z

@mkavulich I've reached out to Jong to see if he can recommend someone that can hopefully assist with this issue.

* add metplus paths * add run_vx.local for jet Co-authored-by: Edward Snyder <Edward.Snyder@noaa.com>

MichaelLueken · 2023-06-08T17:28:32Z

Following the discussion during the SRW CM meeting, I will update the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test config file to use DT_ATMOS = 180 and then create an issue on the issue.

…V3GFS_suite_GFS_v17_p8_plot WE2E test.

MichaelLueken · 2023-06-09T15:48:36Z

The Jenkins test on Cheyenne GNU successfully passed. Moving forward with merging this PR now. For more details on the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test filaure, please see issue #827.

MichaelLueken and others added 3 commits May 15, 2023 16:54

[develop] Update ufs-weather-model, UPP, and UFS_UTILS hashes. Commen…

bbab116

…t out WE2E coverage tests that are currently failing.

Merge branch 'ufs-community:develop' into feature/WM-hash-update

cecd808

[develop] Update modulefiles/build_gaea_intel.lua to allow SRW to bui…

4d3482d

…ld and run on Gaea

MichaelLueken mentioned this pull request May 16, 2023

[develop] Input namelist related changes needed for RRFS #744

Merged

37 tasks

danielabdi-noaa approved these changes May 16, 2023

View reviewed changes

chan-hoo approved these changes May 16, 2023

View reviewed changes

mkavulich requested changes May 16, 2023

View reviewed changes

MichaelLueken and others added 2 commits May 16, 2023 15:01

Update WM hash to yesterday's update to address a failure in the Gaea…

d6dce50

… WE2E coverage tests.

Merge branch 'ufs-community:develop' into feature/WM-hash-update

b5776c4

mkavulich reviewed May 16, 2023

View reviewed changes

Externals.cfg Show resolved Hide resolved

[develop] Set DT_ATMOS for RRFS_CONUS_25km to 150 to allow WE2E tests…

9649937

… to pass with current UFS-WM hash, revert modulefiles/build_gaea_intel.lua to current develop, and comment out Gaea in Jenkinsfile.

[develop] Update test_calculate_cost.py to use 150 rather than 180 to…

87f4998

… allow unit tests to pass.

MichaelLueken changed the title ~~[develop] Update WM, UPP, and UFS_UTILS hashes, comment out WE2E coverage tests that consistently fail, and update Gaea modulefile~~ [develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms May 26, 2023

natalie-perlin approved these changes May 31, 2023

View reviewed changes

MichaelLueken added 2 commits June 2, 2023 21:05

[develop] Update RRFS_AK_3km grid's DT_ATMOS value from 10 to 30 to a…

2fe7493

…llow grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test to run to completion on Jet.

Merge branch 'feature/WM-hash-update' of github.com:MichaelLueken/ufs…

3301672

…-srweather-app into feature/WM-hash-update

This was referenced Jun 5, 2023

[develop] Split EnsembleStat vx tasks into GenEnsProd and EnsembleStat #809

Merged

[develop] Integrate Unified Workflow templater tool #793

Merged

MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 5, 2023

MichaelLueken mentioned this pull request Jun 5, 2023

The SRW App's grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR WE2E test fails following the changes introduced in hash 889254a ufs-community/ufs-weather-model#1787

Closed

Merge branch 'ufs-community:develop' into feature/WM-hash-update

fabed48

michelleharrold pushed a commit to michelleharrold/ufs-srweather-app that referenced this pull request Jun 7, 2023

[develop] Add MET/METplus to Jet (ufs-community#799)

e328595

* add metplus paths * add run_vx.local for jet Co-authored-by: Edward Snyder <Edward.Snyder@noaa.com>

MichaelLueken and others added 2 commits June 8, 2023 18:46

[develop] Set DT_ATMOS=180 for grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_F…

977c467

…V3GFS_suite_GFS_v17_p8_plot WE2E test.

Merge branch 'ufs-community:develop' into feature/WM-hash-update

6059e47

MichaelLueken merged commit 294e18b into ufs-community:develop Jun 9, 2023

MichaelLueken deleted the feature/WM-hash-update branch June 9, 2023 15:51

MichaelLueken mentioned this pull request Jun 14, 2023

CFL violations and issues plotting output from changes made in PR #65 (3a306a4) for RRFS_CONUS_25km grid ufs-community/ccpp-physics#82

Open

MichaelLueken mentioned this pull request Jul 31, 2023

GST_release_public_v1 test fails on Hera in latest develop #874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

MichaelLueken commented May 15, 2023 •

edited

Loading

danielabdi-noaa left a comment

mkavulich left a comment

MichaelLueken commented May 16, 2023

mkavulich commented May 16, 2023

MichaelLueken commented May 23, 2023

MichaelLueken commented May 26, 2023

MichaelLueken commented May 31, 2023

mkavulich commented Jun 2, 2023

MichaelLueken commented Jun 2, 2023

MichaelLueken commented Jun 7, 2023

mkavulich commented Jun 7, 2023

MichaelLueken commented Jun 7, 2023

mkavulich commented Jun 7, 2023

MichaelLueken commented Jun 7, 2023

MichaelLueken commented Jun 8, 2023

MichaelLueken commented Jun 9, 2023

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

Conversation

MichaelLueken commented May 15, 2023 • edited Loading

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

CHECKLIST

danielabdi-noaa left a comment

Choose a reason for hiding this comment

mkavulich left a comment

Choose a reason for hiding this comment

MichaelLueken commented May 16, 2023

mkavulich commented May 16, 2023

MichaelLueken commented May 23, 2023

MichaelLueken commented May 26, 2023

MichaelLueken commented May 31, 2023

mkavulich commented Jun 2, 2023

MichaelLueken commented Jun 2, 2023

MichaelLueken commented Jun 7, 2023

mkavulich commented Jun 7, 2023

MichaelLueken commented Jun 7, 2023

mkavulich commented Jun 7, 2023

MichaelLueken commented Jun 7, 2023

MichaelLueken commented Jun 8, 2023

MichaelLueken commented Jun 9, 2023

MichaelLueken commented May 15, 2023 •

edited

Loading