Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hercules support #1733

Merged

Conversation

ulmononian
Copy link
Collaborator

@ulmononian ulmononian commented May 2, 2023

Description

MSU Hercules was recently made available for NOAA R&D use. Though it shares a file system with Orion, its system specs and software stack are significantly different. Therefore, to enable running the UFS-WM on Hercules, several of the RT configs/scripts need updated. A new ufs_hercules.intel.lua file is also required.

UPDATE: rocoto has been installed under contrib; cron services are available on hercules-login-1. rocoto is not yet available on Hercules (at least not as a system default module), though some software is still being installed. An inquiry has been sent as to whether there is a plan to install rocoto in the near-term. Thus, for initial testing, the RTs can just be run directly (without a workflow manager) or with ecflow.

The spack-stack/1.4.0 unified-envis already installed adjacent to the Orion spack-stack installations (/work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/envs/unified-env-v2/install/), so testing on Hercules can parallel spack-stack testing in PR #1707.

Note that windfall is currently the only QOS available, so SBATCH settings will need to be adjusted once all QOS options are opened.

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:
    There are not currently standardized RTs for Hercules to compare against.

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

#1707
Will close #1732

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@ulmononian
Copy link
Collaborator Author

err and out files from the /work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_838794/cpld_control_p8:

hercules_cpld_control_p8_err.txt
hercules_cpld_control_p8_out.txt

@ulmononian
Copy link
Collaborator Author

based on efforts from @BijuThomas-NOAA and @christopherwharrop-noaa, it has been identified that rocoto is currently only functional on login node 1 (see #1732 for initial bug report). a ticket has been submitted by @christopherwharrop-noaa to install missing package dependencies on the other nodes.

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 11, 2023

some updates:

non-coupled test case control_c48 compiles and runs successfully with the following changes:

  • --mpi=pmi2 arg. addition to the srun command in the fv3_slurm.IN.hercules
  • using esmf/8.4.2 and mapl/2.35.2 (esmf/8.3.0b09 causes segfaults); ufs_common will probably need updated, but for now the newer esmf and mapl versions are just loaded in ufs_hercules.intel

cpld_control_noaero_p8 succeeds: /work2/noaa/epic-ps/cbook/stmp/cbook/FV3_RT/rt_358769/cpld_control_noaero_p8

cpld_control_p8 is currently failing midway through the run: /work2/noaa/epic-ps/cbook/stmp/cbook/FV3_RT/rt_354763/cpld_control_p8 (see screencap below). mapl and esmf debug versions will be added to spack-stack/1.3.1 on hercules for debugging purposes, as it immediately following at a gocart/mapl step.

Screen Shot 2023-05-11 at 3 50 56 PM

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 12, 2023

112/126 RT's pass w/ intel-based spack-stack/1.3.1. the failed tests were:

cpld_control_p8_mixedmode
cpld_control_p8
cpld_control_ciceC_p8
cpld_control_c192_p8
cpld_bmark_p8
cpld_debug_p8
cpld_control_p8_faster
hafs_regional_storm_following_1nest_atm_ocn_debug
datm_cdeps_mx025_cfsr
datm_cdeps_mx025_gefs
atmaero_control_p8
atmaero_control_p8_rad
atmaero_control_p8_rad_micro
regional_atmaq_debug

suite rundir: /work2/noaa/epic-ps/cbook/hercules_wm/FV3_RT/rt_1108733

i installed mapl/esmf debug versions on hercules and re-ran cpld_control_p8 w/ them: /work2/noaa/epic-ps/cbook/hercules_wm/FV3_RT/rt_1126653. err backtrace snippet looks like:

Screen Shot 2023-05-11 at 10 31 54 PM

@jkbk2004

@christopherwharrop-noaa

FYI - I got a response from the Hercules team about installation of Ruby packages on other nodes. The team will take a look early next week. I don't anticipate any issues; I think they just forgot to follow up with the final step of propagating the packages to other nodes after I had verified my testing of Rocoto was successful.

@ulmononian ulmononian marked this pull request as ready for review May 31, 2023 15:25
@zach1221
Copy link
Collaborator

I tried running the cpld_control_nowave_noaero_p8 and cpld_debug_p8_intel commit 80a2995. They failed with

+ cp '/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/input-data-20221101/FV3_fix/*.txt' .
cp: cannot stat '/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/input-data-20221101/FV3_fix/*.txt': No such file or directory
+ '[' 1 -eq 0 ']'
+ write_fail_test
+ echo 'cpld_control_nowave_noaero_p8_intel 001 failed in run_test'
+ exit 1

I got the same. Trying again now after Cameron's adjustment.

@DeniseWorthen
Copy link
Collaborator

@ulmononian Can you fix the BM_IC-20220207 sym-link also?

@ulmononian
Copy link
Collaborator Author

done.

@DeniseWorthen
Copy link
Collaborator

@ulmononian Thanks for fixing the input directories. I was able to run two tests and they passed against the
/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230825

I'm assuming that was created on Hercules yesterday, not copied over from Orion.

@ulmononian
Copy link
Collaborator Author

/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230825

yes, as far as i am aware, those are the baselines created on hercules.

@zach1221
Copy link
Collaborator

/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230825

yes, as far as i am aware, those are the baselines created on hercules.

Yes, they were created on Hercules.

@zach1221
Copy link
Collaborator

@ulmononian I realized when I ran the cpld_control ORTs manually I used intel, I tried it again with gnu and received the below error.
image

However, after following your recommendation and removing line list(APPEND CDEPS_SHARE_DEFS "CPRGNU") in CDEPS-interface/CMakeLists.txt and CMEPS-interface/CMakeLists.txt then moving it immediately below if(CMAKE_Fortran_COMPILER_ID MATCHES "GNU"), then the tests pass. Can you update this PR to include the change?

I think Fernando has had some regression tests fail with gnu on Hera as well, so this may resolve those too.

@zach1221
Copy link
Collaborator

Looks like we're mostly done. I have Gaea running right now, since it was down yesterday. Should be finished soon.

@zach1221
Copy link
Collaborator

Hey, @ulmononian . Can you update the PR template here .github/pull_request_template.md to add Hercules as a machine name? I can make the change as well if you're ok with it.

@zach1221
Copy link
Collaborator

@ulmononian Thank you! We're finished with testing. If you can please resolve the two conversations above, then we can begin the final review/merge process.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Sep 20, 2023

@ulmononian Thank you! We're finished with testing. If you can please resolve the two conversations above, then we can begin the final review/merge process.

done. thanks for testing, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jenkins-ci Jenkins CI: ORT build/test on docker container No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for MSU Hercules