Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update modulefiles to use hdf/1.14.0 and netcdf/4.9.2-based software stacks on Tier 1 systems #889

Merged
merged 25 commits into from
Sep 15, 2023

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Aug 16, 2023

DESCRIPTION OF CHANGES:

  • Update SRW modulefiles to use hpc-stacks with higher versions of software modules on Tier 1 platforms, similar to currently used for UFS-WM.
  • Update UFS_SRW_data location on Orion to use new dedicated role-epic space

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

UPDATE: Fundamental tests pass on all EPIC-accessible Tier-1 platforms (Hera, Gaea Intel/Gnu, Jet, Orion).
To mitigate data transfer tasks failures, increased walltime request and maxtries=2.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

This PR follows UFS-WM advance to higher-version software modules, allowing fully coupled runs (S2SWA), ufs-community/ufs-weather-model#1745

DOCUMENTATION:

ISSUE:

This PR follows UFS-WM advance to higher-version software modules, allowing fully coupled runs (S2SWA), ufs-community/ufs-weather-model#1745

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
    Subsequent updates to spack-stack expected soon that require more documentation updates
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test (not tested for Cheyenne - no allocation)
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@MichaelLueken MichaelLueken changed the title Update modulefiles to use hdf/1.14.0 and netcdf/4.9.2-based software stacks on Tier 1 systems [develop] Update modulefiles to use hdf/1.14.0 and netcdf/4.9.2-based software stacks on Tier 1 systems Aug 17, 2023
@natalie-perlin
Copy link
Collaborator Author

Fundamental tests ran on Hera (intel, gnu), Orion, Jet, Gaea.
Attached are the summaries.
Hera, Gaea: Some workflow tasks failed due to hitting time limit; it was mitigated by increasing walltime of data transfer tasks from 45:00 to 1:30:00 and maxtries=2 for get_extrn_ics, get_extrn_lbcs.
For Gaea, a MET verification task failed due to hitting time limit, but successfully completed after rewinding and rerunning.
Failed task from the test grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

202105121200    run_MET_PcpCombine_fcst_APCP01h_mem001                   269326221           SUCCEEDED                   0         1          33.0
202105121200    run_MET_PcpCombine_fcst_APCP01h_mem002                   269326238                **DEAD**                 255         2         309.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem001                   269326222           SUCCEEDED                   0         1          32.0

After rerunning, it completed successfully:

202105121200    run_MET_PcpCombine_fcst_APCP01h_mem001                   269326221           SUCCEEDED                   0         1          33.0
202105121200    run_MET_PcpCombine_fcst_APCP01h_mem002                   269326273           **SUCCEEDED**                   0         1          48.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem001                   269326222           SUCCEEDED                   0         1          32.0

Still looking into random failures in Jet testing.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - Unfortunately, the Jenkins tests all failed at the Build phase when the Jenkins tests were submitted on Friday. The failure is due to the Jenkins build still building the GSI and rrfs_utils. Without ncdiag, neither the GSI nor rrfs_utils can build, which is what caused the failure. Please uncomment line 30 for now and ncdiag will be removed as part of @christinaholtNOAA's PR #893. Thanks!


load("ncdiag/1.1.1")
--load("ncdiag/1.1.1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the Jenkins tests failed Friday night at the Build phase because GSI and rrfs_utils are still included in the Jenkins build. They will be removed at part of Christina's PR #893. Please reintroduce ncdiag until it is removed in PR #893.

keep ncdiag/1.1.1
update miniconda3 location in new role-epic space
update miniconda3 location in new role-epic space
@MichaelLueken
Copy link
Collaborator

Thanks for updating the modulefiles, @natalie-perlin! Resubmitting the Jenkins tests now.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - The Gaea tests are failing to build the GSI due to lines 19-24 in build_gaea_intel.lua having been removed. Please add these lines back so that the GSI and rrfs_utils can build on the machine. I'll queue up the Gaea tests for this PR once completed (they will run once the rest of the tests complete). Thanks!

modulefiles/build_gaea_intel.lua Outdated Show resolved Hide resolved
@MichaelLueken
Copy link
Collaborator

Thanks, @natalie-perlin! I've added the Gaea test to the queue. Once the current tests complete, the Gaea test will be resubmitted.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - The Orion automated WE2E coverage test, grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot, is failing with the following error message:

Traceback (most recent call last):
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/scripts/exregional_plot_allvars.py", line 1419, in <module>
    main()
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/scripts/exregional_plot_allvars.py", line 519, in main
    plot_all(dom)
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/scripts/exregional_plot_allvars.py", line 554, in plot_all
    myproj = ccrs.LambertConformal(
TypeError: __init__() got an unexpected keyword argument 'secant_latitudes'

This failure is occurring in the Jenkins tests and manually running the WE2E coverage tests on Orion. Please see /work/noaa/epic/role-epic/jenkins/workspace/fs-srweather-app_pipeline_PR-889/expt_dirs/grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot
for the failure in the Jenkins tests.

@MichaelLueken
Copy link
Collaborator

Manual runs of the WE2E coverage tests have successfully passed on Hera Intel:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              23.40
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               5.45
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             761.96
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.75
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               5.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              12.46
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE               9.64
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               5.80
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             225.42
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             301.39
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             328.90
pregen_grid_orog_sfc_climo                                         COMPLETE               6.83
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1700.74

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Sep 12, 2023

@MichaelLueken -
Oh, OK, regional_workflow environment in miniconda3 had too recent packages (maplotlib/3.7.2) that the plotting scripts did not address, and that were suitable for matplotlib/3.5.2. Miniconda3 in the old role-epic-ps space on Orion had the older maplotlib/3.5.2. I corrected the remaining task files for Orion today, and the issue popped up.
(This is a know issue with plotting scripts not working with newer matplotlib)
Reinstalled regional_workflow with the exact same package list as in earlier installation. Previou

miniconda3.regional_workflow.install.log.txt

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Thanks! I've queued up the Orion tests in Jenkins.

The Gaea tests continue to fail for the Build phase while linking the GSI executable:

ld: /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a(dtrti2.o): in function `mkl_lapack_dtrti2':
dtrti2_gen.f:(.text+0x25b): undefined reference to `mkl_blas_dscal'
ld: /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a(dsyev.o): in function `mkl_lapack_dsyev':
dsyev_gen.f:(.text+0x258): undefined reference to `mkl_lapack_dsytrd'
ld: /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a(dsteqr.o): in function `mkl_lapack_dsteqr':
dsteqr_gen.f:(.text+0xd1b): undefined reference to `mkl_lapack_dlasr3'

You can see the cmake output log - /lustre/f2/dev/wpo/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-889/build_intel/log.make.

@MichaelLueken
Copy link
Collaborator

I've completed a quick run of the WE2E coverage tests on Orion and they all successfully pass now:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km                                            COMPLETE             181.45
deactivate_tasks                                                   COMPLETE               1.89
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             753.93
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   COMPLETE             286.54
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        COMPLETE             140.53
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              15.46
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             487.46
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              33.02
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             274.88
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              15.08
nco                                                                COMPLETE               7.60
2020_CAD                                                           COMPLETE              32.32
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2230.16

We'll see if the Jenkins tests also succeed.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - The Orion Jenkins tests successfully passed as well! The only issue with this PR now is that the GSI and rrfs_utils are failing to build on Gaea using the updated intel/2023.1.0 stack. I believe that these will need to be added to build_gaea_intel.lua, rather than wflow_gaea.lua. I'll try testing this setup on Gaea and will let you know.

@mkavulich
Copy link
Collaborator

@MichaelLueken @natalie-perlin Why are we spending time and resources to fix a problem with GSI on a single platform, given that PR #893 is removing that capability entirely?

@MichaelLueken
Copy link
Collaborator

@mkavulich Fair point. Making this PR dependent on PR #893 makes sense (especially since the only failure now is with the GSI/rrfs_utils on Gaea) and relaunching the Gaea test once #893 has been merged.

@natalie-perlin No changes are required now. Once PR #893 has been merged, you will likely need to merge develop into your branch, but then I should be able to resubmit the Gaea tests without issue.

@mkavulich
Copy link
Collaborator

Thanks @MichaelLueken, and sorry for the terse tone in my original message, this was posted before I had finished my coffee 😄

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken @mkavulich -
My apologies for such a mistake of not placing things to the same modulefile for Gaea!..
This needs to be changed in either case

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - PR #893 has been successfully merged. As expected, there is now a conflict in modulefiles/build_gaea_intel.lua. Please merge the latest develop branch into your update_modulefiles branch, address the noted conflict, then I can kick off the Gaea Jenkins test for the last time for this PR.

Additionally, should lines 16-19 (possibly 20) also be removed from modulefiles/wflow_gaea.lua? These lines appear to be similar to the lines that had been removed in modulefiles/build_gaea_intel.lua.

Thanks!

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - updated wflow_gaea.lua and merged with the updated develop branch

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Thanks! The Gaea tests have been submitted. Once complete, I will merge this PR.

@MichaelLueken
Copy link
Collaborator

The Jenkins WE2E coverage tests on Gaea have successfully passed! Merging now.

@MichaelLueken MichaelLueken merged commit c00ca73 into ufs-community:develop Sep 15, 2023
@natalie-perlin natalie-perlin deleted the update_modulefiles branch October 13, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority: HIGH run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update SRW software stack versions to those used in UFS-WM
4 participants