Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hercules Support #880

Merged

Conversation

DavidHuber-NOAA
Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Dec 5, 2023

DESCRIPTION OF CHANGES:

This ports the UFS_Utils repository to Hercules, including regression tests.

TESTS CONDUCTED:

If there are changes to the build or source code, the tests below must be conducted. Contact a repository manager if you need assistance.

  • Compile branch on all Tier 1 machines using Intel (Orion, Jet, Hera, Hercules, and WCOSS2). Done using c82d834.
  • Compile branch on Hera using GNU. Done using c82d834.
  • Compile branch in 'Debug' mode on WCOSS2. Done using c82d834.
  • Run unit tests locally on any Tier 1 machine. Done on Hercules using c82d834.
  • Run relevant consistency tests locally on Hercules. Done using c82d834. See below for details.

Regression test baseline data was copied from Orion. All tests passed except all cpld_gridgen tests and some chgres_cube tests. All differences from the Orion baseline were small and did not indicate a problem with the port to Hercules. New baseline will be created for Hercules.

DOCUMENTATION:

Updated documentation to include Hercules in the list of supported tier-1 platforms.

ISSUE:

Fixes #878

CONTRIBUTORS:

@BinLiu-NOAA @BijuThomas-NOAA

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA I have completed regression tests for this branch on Hercules against the Orion baseline. The only tests that failed belonged to chgres_cube and cpld_gridgen. This may be due to different operating systems and compilers between the two machines. Looking through the consistency logs, I didn't see any glaringly different values. Would you mind taking a look? They are in /work/noaa/global/dhuber/para/stmp/ufs_utils_rt/UFS_UTILS/reg_tests.

If you are happy with them, then could I ask that the new baseline and fix files be placed in /work/noaa/nems/role-nems/hercules/ufs_utils. I will then update the driver scripts and rerun RTs to verify consistency.

#-----------------------------------------------------------------------------
# Initialize C96 using spectral GFS sigio/sfcio files.
#-----------------------------------------------------------------------------

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test almost timed out for me. Can you boost the wall clock time by 5 minutes?

This test runs 5 minutes longer on Hercules compared to Orion. The slow down is in the read of the input atmospheric file. That file contains spectral coefficients that are converted to grid point space using the SP library. Using threads speeds up this conversion. In my experience, when this test runs slow it is because the SP library was not compiled with threads. May want to alert the libraries team.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, thanks George.

@AlexanderRichert-NOAA would you be able to verify if the SP library was built with threading in spack-stack/1.5.0 on Hercules?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Hercules is offline today for maintenance and I am off tomorrow. I will return to this on Monday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this-- judging by the output at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/sp-2.3.3-f4hmlri/.spack/spack-build-out.txt, it appears it was built with openmp support (-DOPENMP=ON, and the -DOPENMP definition is in the compiler calls).

@DavidHuber-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA I am seeing some odd behavior from spack during CI concretization. First, I received an error from spack that sp with -precision specified. This returned the following error:

Cannot set variant 'precision' for package 'sp' because the variant condition cannot be satisfied for the given spec. You could consider setting `concretizer:unify` to `when_possible` or `false` to allow multiple versions of some packages.

I tried the suggestion and changed to unify=when_possible. This allow concretization to complete and the stack build to start, but eventually failed as both sp/2.5.0 and 2.3.3 were concretized, eventually resulting in the following errors:

2023-12-18T17:40:48.1432892Z ==> esmf: Successfully installed esmf-8.4.2-hz5hbcumsfmh63egqjvzvpo2zyadue7n
2023-12-18T17:40:48.1434340Z   Stage: 0.53s.  Edit: 0.00s.  Build: 8m 23.95s.  Install: 8.35s.  Post-install: 0.13s.  Total: 8m 33.05s
2023-12-18T17:40:48.1436579Z [+] /home/runner/work/UFS_UTILS/UFS_UTILS/spack/opt/spack/linux-ubuntu22.04-zen2/gcc-10.5.0/esmf-8.4.2-hz5hbcumsfmh63egqjvzvpo2zyadue7n
2023-12-18T17:40:48.3233603Z ==> Error: The environment view in /home/runner/work/UFS_UTILS/UFS_UTILS/spack/var/spack/environments/ufs_utils-env/.spack-env/view could not be created, because the following two specs project to the same prefix:
2023-12-18T17:40:48.3237261Z ==> Updating view at /home/runner/work/UFS_UTILS/UFS_UTILS/spack/var/spack/environments/ufs_utils-env/.spack-env/view
2023-12-18T17:40:48.3239697Z     sp@=2.3.3%gcc@=10.5.0~ipo~openmp+pic build_system=cmake build_type=Release generator=make arch=linux-ubuntu22.04-zen2, and
2023-12-18T17:40:48.3242427Z     sp@=2.5.0%gcc@=10.5.0~ipo~openmp+pic~shared build_system=cmake build_type=Release generator=make precision=4,d arch=linux-ubuntu22.04-zen2.
2023-12-18T17:40:48.3244005Z     To resolve this issue:
2023-12-18T17:40:48.3245432Z         a. use `concretization:unify:true` to ensure there is only one package per spec in the environment, or
2023-12-18T17:40:48.3246731Z         b. disable views with `view:false`, or
2023-12-18T17:40:48.3249899Z         c. create custom view projections.

Do you have any suggestions on how to resolve this?

@AlexanderRichert-NOAA
Copy link
Collaborator

Would it work to upgrade sp to 2.5.0 in ci/spack.yaml? Or does it need to be 2.3.3?

@DavidHuber-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA I will give that a try. Thanks.

@DavidHuber-NOAA
Copy link
Collaborator Author

Looks like that worked. Thanks again @AlexanderRichert-NOAA.

@DavidHuber-NOAA DavidHuber-NOAA marked this pull request as ready for review December 18, 2023 20:21
@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA I have rerun the regression tests on Hercules against the baseline. A couple of issues were noted:

  1. The chgres_cube test case c192_gfs_grib2 did not have baseline data available
  2. Seemingly large errors seem to be present in the c96_fv3_netcdf2wam chgres_cube test case in variables zh, t, spo, and spo2
  3. Relatively small errors were noted in several other chgres_cube and cpld_gridgen tests, though I don't think these are concerning

Would you mind taking a look at the c96_fv3_netcdf2wam test case and seeing if differences are problematic? The log file is /work/noaa/global/dhuber/para/stmp/ufs_utils_rt/UFS_UTILS/reg_tests/chgres_cube/consistency.log07. If everything is OK, could you update the baselines for chgres_cube and cpld_gridgen?

Lastly, I have marked this PR ready for review now that the CI issues have been resolved.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA I have rerun the regression tests on Hercules against the baseline. A couple of issues were noted:

  1. The chgres_cube test case c192_gfs_grib2 did not have baseline data available

Sorry, I just added the baseline data for this test. Try again.

  1. Seemingly large errors seem to be present in the c96_fv3_netcdf2wam chgres_cube test case in variables zh, t, spo, and spo2
  2. Relatively small errors were noted in several other chgres_cube and cpld_gridgen tests, though I don't think these are concerning

Would you mind taking a look at the c96_fv3_netcdf2wam test case and seeing if differences are problematic? The log file is /work/noaa/global/dhuber/para/stmp/ufs_utils_rt/UFS_UTILS/reg_tests/chgres_cube/consistency.log07. If everything is OK, could you update the baselines for chgres_cube and cpld_gridgen?

Yes, I can take a look at that test.

Lastly, I have marked this PR ready for review now that the CI issues have been resolved.

@GeorgeGayno-NOAA
Copy link
Collaborator

Here are the results of the c96_fv3_netcdf2wam test for tile 5. Other tiles showed similar differences. Check to see if these are reasonable.

+ nccmp -dmfqS out.atm.tile5.nc /work/noaa/nems/role-nems/ufs_utils.hercules/reg_tests/chgres_cube/baseline_data/c96_fv3_netcdf2wam/out.atm.tile5.nc
Variable Group Count          Sum      AbsSum          Min         Max       Range         Mean      StdDev
zh       /       130     0.015625     2.45312     -0.03125      0.0625     0.09375  0.000120192   0.0208178
t        /       116 -0.000305176   0.0107422 -0.000549316 0.000244141 0.000793457 -2.63083e-06 0.000123296
spo      /       683 -1.16251e-05  4.3446e-05 -5.78165e-06  8.9407e-07 6.67572e-06 -1.70206e-08 3.52063e-07
spo2     /      1761 -1.19589e-06 2.89326e-05 -4.09782e-07 4.91738e-07  9.0152e-07 -6.79098e-10 3.25827e-08

@GeorgeGayno-NOAA
Copy link
Collaborator

Here are the results of the c96_fv3_netcdf2wam test for tile 5. Other tiles showed similar differences. Check to see if these are reasonable.

+ nccmp -dmfqS out.atm.tile5.nc /work/noaa/nems/role-nems/ufs_utils.hercules/reg_tests/chgres_cube/baseline_data/c96_fv3_netcdf2wam/out.atm.tile5.nc
Variable Group Count          Sum      AbsSum          Min         Max       Range         Mean      StdDev
zh       /       130     0.015625     2.45312     -0.03125      0.0625     0.09375  0.000120192   0.0208178
t        /       116 -0.000305176   0.0107422 -0.000549316 0.000244141 0.000793457 -2.63083e-06 0.000123296
spo      /       683 -1.16251e-05  4.3446e-05 -5.78165e-06  8.9407e-07 6.67572e-06 -1.70206e-08 3.52063e-07
spo2     /      1761 -1.19589e-06 2.89326e-05 -4.09782e-07 4.91738e-07  9.0152e-07 -6.79098e-10 3.25827e-08

These differences are small and don't indicate a problem with the port. First, the number of points that have non-zero differences is a very small part of the total (each tile has 99x96x150 - or 1.3 million - points). The 'zh' differences appear to be 'large' but the differences are largest near the top of the atmosphere, where 'zh' is on order of 520,000 meters. So, these differences are relatively small.

The chgres_cube program is running correctly on Hercules.

@DavidHuber-NOAA
Copy link
Collaborator Author

Great, thanks for checking @GeorgeGayno-NOAA and for pushing the baselines to an official location. I just did one more sweep of my changes and cleaned up a little on top of your suggested ulimit. Since this doesn't change any code, does it need to be RT'd or unit tested on the other HPCs?

@GeorgeGayno-NOAA
Copy link
Collaborator

Great, thanks for checking @GeorgeGayno-NOAA and for pushing the baselines to an official location. I just did one more sweep of my changes and cleaned up a little on top of your suggested ulimit. Since this doesn't change any code, does it need to be RT'd or unit tested on the other HPCs?

No, we can just run the testing on Hercules. However, I will do a quick check to make sure it compiles on all machines.

@GeorgeGayno-NOAA GeorgeGayno-NOAA self-requested a review December 19, 2023 20:48
@GeorgeGayno-NOAA GeorgeGayno-NOAA merged commit 9bba725 into ufs-community:develop Dec 19, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable support on Hercules
3 participants