-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] MET_verification: use modules met, metplus #826
[develop] MET_verification: use modules met, metplus #826
Conversation
@natalie-perlin Thanks for switching MET/METplus to the official software stacks. Couple of questions:
Thanks. |
@natalie-perlin Since this PR will correct the the failure of the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental WE2E test on Cheyenne GNU, I will go ahead and close PR #820 at this time. |
The verification WE2E tests were run on Orion and all tests successfully passed:
The Cheyenne Intel WE2E coverage tests were manually run on Hera and all tests successfully passed:
And the Cheyenne GNU WE2E coverage tests were manually run on Hera and all tests successfully passed:
Awaiting completion of Jenkins tests now. |
The Hera GNU WE2E coverage test,
While changing from static libraries to HPC-stack modules, are you aware of any reason why these two tasks seem to now run longer for Hera GNU? I'm seeing this for both manual and automated Jenkins testing. Please see |
@MichaelLueken - |
I was able to successfully run all verification tests on Hera GNU, with the exception of
Since the rest of the tests run without issue, I'm inclined to agree that there might be an issue with the current |
@MichaelLueken - the issue with Hera gnu was with the installation of these older modules. All seem to be resolved after I reinstalled met/10.1.2 and metplus/4.1.3.
The last test, MET_ensemble_verification_only_vx_time_lag, is still running, but all going successfully so far:
On Cheyenne, the previous installations of met and metplus for Gnu compilers were fully complete for met/11.02 and metplus/5.0.2; thus I reinstalled met/10.1.2 and metplus/4.1.3 for Cheyenne Gnu:
Tests cannot be run though due to account overspent. Please feel free to test it if needed. |
I'm resubmitting the Please let me know if you the two tasks that were still running for your test - |
@MichaelLueken - yes, indeed these two tasks timed out, even though they seemed to be running OK as appears from the log files. I'm resubmitting these tasks and increasing the time to 2:00:00 from the original 1:00:00, modifying them directly in FV3LAM_wflow.xml. Some comparisons of similar tasks for the runs with intel compiler vs. runs with gnu compiler indeed show a significant time increase, 2-fold and 4+-fold,. Notice for example, tasks run_MET_GenEnsProd_vx_REFC, run_MET_EnsembleStat_vx_REFC, run_MET_GenEnsProd_vx_RETOP, run_MET_EnsembleStat_vx_RETOP:
|
The tasks run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA had the following runtimes for the Hera Intel compilers:
Anticipating possible 4-fold increase, we'd need to have ~6300 s, i.e. 1.75 hours, or for 4.6-fold increase, slightly over 7200 s (2hours). |
@MichaelLueken -
How we should proceed with this? |
@mkavulich - The tasks run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA running on Hera gnu/9.2 take way over a default hour of walltime. |
It looks like going into
will increase the walltime for the The final section would look like:
Additionally, you will need to add the following to line 156:
Otherwise, the updated walltime will be used in the
Please see Please note that I'm still in the process of testing these changes, so I would recommend running the test with the modified |
My test has successfully completed:
The two tasks that were failing are now passing with the increased walltime from
Once you have verified that the modification works, please commit the change, push it to your branch, then I will relaunch the Hera Jenkins tests. |
…Hera+gnu/9.2 compilers
Implemented recent changes in develop into my branch, added the changes for the ./parm/wflow/parm/wflow/verify_ens.yaml. The workflow generated FV3LAM_wflow.xml with "wallclock 2:30" correctly, and I'm waiting for the test to finish. |
The Hera GNU WE2E coverage tests have been manually run and all tests have successfully passed:
Will requeue the automated tests on Hera to make sure they pass now as well. |
All the tasks succeeded using the updated branch:
So if no more questions, I guess this PR is ready to be merged. |
The automated Jenkins tests on Hera successfully passed GNU:
and Intel:
Moving forward with merging this work now. |
@gsketefian - The binaries that returns error are from ./met/10.1.2/bin/, and are grid_stat, pb2nc, pcp_combine (maybe more), so this must be something to deal with the netcdf files. |
@natalie-perlin As far as I know, in the SRW App the vx that MET/METplus does uses grib2, netcdf, and text formats. But I'm not an expert on METplus inner workings. What installation of MET/METplus on Derecho are you using? If it wasn't installed by someone on the METplus team, it might be worthwhile to contact them. They'll have more insight than me. I think Julie Prestopnik is usually the one who does installations. |
* Mods to METplus conf files: TCDC specifications, correction to level specificatons in point-stat mean and prob files, and added functionality to make METplus output dirs in ex scripts. * Updated comments in MET ex-scripts for creating output directories. * Fixed minor formatting issue in exregional_run_gridstatvx.sh
DESCRIPTION OF CHANGES:
Use modules met and metplus installed in a software stack for MET verification tasks. Do not use explicitly set paths in machine files or config.yaml files. Use env. variables METPLUS_PATH, MET_INSTALL_DIR from the modulefiles. Retire MET_BIN_EXEC variables, use standard "bin". Modules met/10.1.2 and metplus/4.1.3 are used. Newer versions, met/11.0.2 and metplus/5.0.2 require additional code changes.
Updated Orion stack location due to a mandatory transition to a new role account and space, under /work/noaa/epic/role-epic/. New stack is under /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/ . SRW fundamental tests have been run successfully with the updates stack, as well as GSI regression tests and UFS-WM regression tests. PR-1846 to the weather model repo has been submitted: Update orion stack path to use a new role-epic account location ufs-weather-model#1846 .
Miniconda3 with all the environments has been installed in a new Orion location, under /work/noaa/epic/role-epic/contrib/orion/miniconda3/ and updated in modulefiles.
Type of change
TESTS CONDUCTED:
DEPENDENCIES:
The current PR resolves the issue #863 .
DOCUMENTATION:
Preliminary changes to documentation in files:
./docs/UsersGuide/source/ConfigWorkflow.rst
./docs/UsersGuide/source/RunSRW.rst
./docs/UsersGuide/source/VXCases.rst
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):