Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Orion build to Rocky 9 #754

Closed
RussTreadon-NOAA opened this issue Jun 10, 2024 · 31 comments · Fixed by #764
Closed

Update Orion build to Rocky 9 #754

RussTreadon-NOAA opened this issue Jun 10, 2024 · 31 comments · Fixed by #764
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

Received the following from RDHPCS Management

Orion’s Operating System (OS) and software stack is scheduled to be upgraded during a two day downtime, starting on Wednesday, June 12th and going through Thursday, June 13th. The OS on Orion will be upgraded from CentOS 7 to Rocky 9, another derivative of Red Hat Linux.

This issue is opened to document the updating of modulefiles/gsi_orion.intel.lua to Rocky 9

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA , what do we know about the Orion Rocky 9 transition for spack-stack? Will Orion Rocky 9 builds use the same spack-stack as Hercules Rocky 9?

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA Spack-stack will be rebuilt for Orion after the upgrade to Rocky 9. I'm not sure if we will need to update our paths in any of the module files, but we will need to update compiler version numbers. Also, I suspect that we will need to unset I_MPI_EXTRA_FILESYSTEM, just as we do for Hercules, in ush/sub_orion.

@DavidHuber-NOAA
Copy link
Collaborator

It looks like the spack-stack installs will be tracked by JCSDA/spack-stack#981.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA for the detailed information. We will track spack-stack #981.

@emilyhcliu
Copy link
Contributor

I updated the ORION modules for GSI
Here is the branch: https://github.com/emilyhcliu/GSI/tree/feature/orion_modules
develop...emilyhcliu:GSI:feature/orion_modules

The GSI compiled successfully with the updated modules.

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Jun 20, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @emilyhcliu for your note. We updated gsi_orion.intel.lua in PR #758 in order to run ctests on Orion. We opted for a simpler change than yours. No change was made to the bufr version. The bufr 12 update is the subject of issue #642

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Jun 20, 2024
RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Jun 20, 2024
RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Jun 20, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

WARNING
See comment in GSI PR #758.

Updating gsi_orion.intel.lua is not sufficient. The CRTM coefficient files in the CRTM_FIX associated with the Orion crtm-fix/2.4.0.1_emc differ from the coefficient files in the CRTM_FIX associated with the Hercules crtm-fix/2.4.0.1_emc. This is not correct.

@RussTreadon-NOAA
Copy link
Contributor Author

@TingLei-NOAA , given the Hercules HelpDesk inquiry (RDHPCS#2024020854000112) would you be willing to update gsi_orion.intel.lua in a working copy of develop and then run the test they asked you to run?

Emily provides a link above to her branch with Orion Rocky 9 updates. Click here to see exactly what she changed.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks. Ok I will follow the link you gave pointing to Emily's branch to have a working copy .

@TingLei-NOAA
Copy link
Contributor

An update: started from Emily's branch , first test the modules of https://github.com/emilyhcliu/GSI/blob/feature/orion_modules/modulefiles/gsi_orion.intel.lua.
After changing, python to a newer version, it is found some issues , for example, as

tlei$ module spider stack-intel/2021.9.0

-----------------------------------------------------------------------------------------------------------------------------------------------------------
  stack-intel: stack-intel/2021.9.0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      stack-intel compiler family and module access


    You will need to load all module(s) on any one of the lines below before the "stack-intel/2021.9.0" module is available to load.

      gsi_hercules.intel

    Help:

It seems something with stack-spack is not setup correctly somewhere. I had opened a ticket with orion helpdesk.

@aerorahul
Copy link
Contributor

SS-1.6.0 on Orion Rocky 9 w/ gsi-addon-env
/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9/install/modulefiles/Core

Any idea when the GSI will be updated to use this on Orion?

@DavidHuber-NOAA
Copy link
Collaborator

One of the CRTM fix files is incorrect and needs to be updated on Orion. I have opened issue JCSDA/spack-stack#1158 to correct this.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA for reporting the Orion Rocky 9 spack-stack 1.6.0 crtm issue.

@RussTreadon-NOAA
Copy link
Contributor Author

@TingLei-NOAA , what is the status of your Orion Rocky 9 GSI tests?

@DavidHuber-NOAA
Copy link
Collaborator

The CRTM-fix file has been corrected, so the GSI should be good to proceed.

@RussTreadon-NOAA
Copy link
Contributor Author

I updated the ORION modules for GSI Here is the branch: https://github.com/emilyhcliu/GSI/tree/feature/orion_modules develop...emilyhcliu:GSI:feature/orion_modules

The GSI compiled successfully with the updated modules.

I updated the ORION modules for GSI Here is the branch: https://github.com/emilyhcliu/GSI/tree/feature/orion_modules develop...emilyhcliu:GSI:feature/orion_modules

The GSI compiled successfully with the updated modules.

@emilyhcliu , after you built gsi.x and enkf.x did you run either on Orion? I'm finding that gsi.x runs much slower on Orion than it does on Hercules. In fact the 10 minute wall clock limit for global_4denvar is not sufficient on Orion. gsi.x does not even make it to the minimization for the first outer loop within the 10 minute wall clock limit. The loproc global_4denvar run finishes in about 435 seconds, a bit more than 6 minutes, on Hercules.

@aerorahul
Copy link
Contributor

I updated the ORION modules for GSI Here is the branch: https://github.com/emilyhcliu/GSI/tree/feature/orion_modules develop...emilyhcliu:GSI:feature/orion_modules
The GSI compiled successfully with the updated modules.

I updated the ORION modules for GSI Here is the branch: https://github.com/emilyhcliu/GSI/tree/feature/orion_modules develop...emilyhcliu:GSI:feature/orion_modules
The GSI compiled successfully with the updated modules.

@emilyhcliu , after you built gsi.x and enkf.x did you run either on Orion? I'm finding that gsi.x runs much slower on Orion than it does on Hercules. In fact the 10 minute wall clock limit for global_4denvar is not sufficient on Orion. gsi.x does not even make it to the minimization for the first outer loop within the 10 minute wall clock limit. The loproc global_4denvar run finishes in about 435 seconds, a bit more than 6 minutes, on Hercules.

I notice @emilyhcliu is also updating to bufr 12. Is that a source of the slowdown?

@DavidHuber-NOAA
Copy link
Collaborator

Also, I note that @emilyhcliu is pointing to the CentOS spack-stack build. It should be

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9/install/modulefiles/Core")

@RussTreadon-NOAA
Copy link
Contributor Author

@aerorahul , thank you for noting this. I did not include @emilyhcliu 's bufr/12 change since it would affect all machines on which gsi.x and enkf.x are built. I stuck with bufr/11.7

I wonder if some of the Rocky 9 modules were built with debug options or lower levels of optimization. Alternatively, there could be environment variables we need to add, remove, or change to efficiently run gsi.x on Orion Rocky 9.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA . I checked my local copy of gsi_orion.intel.lua. It has

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9/install/modulefiles/Core")

I am working in /work/noaa/da/rtreadon/git/gsi/develop.

@RussTreadon-NOAA
Copy link
Contributor Author

Orion crtm-fix/2.4.0.1_emc sets CRTM_FIX as

setenv("CRTM_FIX","/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-qls55kd/fix")

whereas Hercules crtm-fix/2.4.0.1_emc sets CRTM_FIX as

setenv("CRTM_FIX","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-2os2hw2/fix")

The CRTM coefficients in these two CRTM_FIX directories should be identical. This is not the case. Files differ between the two directories.

orion-login-3:/work/noaa/da/rtreadon/git/gsi/develop$ diff -r /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-2os2hw2/fix/ /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-qls55kd/fix/ | grep differ | wc
    445    2670  158858

For example, the Orion and Hercules abi_gr.TauCoeff.bin files are not the same size

orion-login-3:/work/noaa/da/rtreadon/git/gsi/develop$ ls -l /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-2os2hw2/fix/abi_gr.TauCoeff.bin /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-qls55kd/fix/abi_gr.TauCoeff.bin
-rw-r--r-- 1 role-epic epic  10972 Dec 20  2023 /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-2os2hw2/fix/abi_gr.TauCoeff.bin
-rw-r--r-- 1 role-epic epic 184588 Jun 14 16:06 /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-qls55kd/fix/abi_gr.TauCoeff.bin

The global_4denvar_loproc_updat test run on Hercules completes in 434.027436 seconds. The same test run on Orion takes 928.665578 seconds. Comparison of the initial penalties show identical printed values for all observation types except radiances.

Orion initial penalties

     J term                                     J
excess   moisture            3.4379735437079659E-02
surface pressure             2.2060242227672843E+04
temperature                  1.0481390490646166E+05
wind                         2.4982063813539472E+05
moisture                     8.8567148061502121E+03
sst                          8.4144309801987820E+03
ozone                        7.4448335047401797E+03
gps bending angle            3.1654965732471156E+05
radiance                     2.7518804758077506E+04
tcp (tropic cyclone)         1.5579209582499374E+02
 -----------------------------------------------------
 J Global                    7.4563505311896792E+05

Hercules initial penalties

     J term                                     J
excess   moisture            3.4379735437079659E-02
surface pressure             2.2060242227672843E+04
temperature                  1.0481390490646166E+05
wind                         2.4982063813539472E+05
moisture                     8.8567148061502121E+03
sst                          8.4144309801987820E+03
ozone                        7.4448335047401797E+03
gps bending angle            3.1654965732471156E+05
radiance                     2.8224888807722826E+04
tcp (tropic cyclone)         1.5579209582499374E+02
 -----------------------------------------------------
 J Global                    7.4634113716861326E+05

Something isn't right on Orion.

@RussTreadon-NOAA
Copy link
Contributor Author

Rerun Orion case using Hercules crtm coefficients. With this change the initial radiance penalty matches the hercules run. However, the Orion gsi.x still runs about 2x slower.

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jun 26, 2024

@RussTreadon-NOAA This is interesting/worrisome RE the CRTM fix files. Orion's unified-env (CentOS) crtm-fix file set matches the Orion unified-env-rocky9 crtm-fix file set (with one exception arising from a missing file in unified-env-rocky9). This implies that Orion has always had a different fix file set than Hercules.

> pwd
/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-qls55kd/fix
> for file in *; do
>    cmp $file /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0.1_emc-ezbyzji/fix/$file
> done
>
> for file in /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0.1_emc-ezbyzji/fix/*; do
>   f=$(basename $file)
>   cmp $f $file
> done
cmp: amsua_metop-c.SpcCoeff.noACC.bin: No such file or directory

The missing amsua_metop-c.SpcCoeff.noACC.bin file is the original amsua_metop-c.SpcCoeff.bin that was installed via spack-stack and replaced via JCSDA/spack-stack#1158. amsua_metop-c.SpcCoeff.bin should have been copied to amsua_metop-c.SpcCoeff.noACC.bin before it was replaced.

Performing the commands

> for file in /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0.1_emc-ezbyzji/fix/*; do
>   f=$(basename $file)
>   cmp $file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/crtm-fix-2.4.0.1_emc-2os2hw2/fix/$f
> done

produces a list of 445 files that differ. I'll report these findings to JCSDA. I will suggest that they compare the fix files on Orion and Hercules with those in production on WCOSS2 to verify which are correct.

@RussTreadon-NOAA
Copy link
Contributor Author

Build GSI develop at 59d7578 in Orion /work/noaa/da/rtreadon/git/gsi/develop with the following local modifications

        modified:   modulefiles/gsi_orion.intel.lua
        modified:   regression/regression_param.sh
        modified:   regression/regression_var.sh
        modified:   ush/detect_machine.sh
        modified:   ush/sub_orion

Run ctests with following results

Test project /work/noaa/da/rtreadon/git/gsi/develop/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  729.75 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  847.45 sec
3/7 Test #2: rtma .............................   Passed  1631.39 sec
4/7 Test #7: global_enkf ......................   Passed  1772.05 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  2789.56 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  3090.37 sec
7/7 Test #1: global_4denvar ...................   Passed  3706.65 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 3707.13 sec

These run times are noticeably higher than what was observed when Orion ran Centos-7. For example, PR #746 contains the following Orion ctest timings

Test project /work2/noaa/da/rtreadon/git/gsi/pr746/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  602.82 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  606.41 sec
3/7 Test #7: global_enkf ......................   Passed  788.65 sec
4/7 Test #2: rtma .............................   Passed  1088.10 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1161.37 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1461.75 sec
7/7 Test #1: global_4denvar ...................   Passed  1743.04 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1743.05 sec

@KateFriedman-NOAA , @DavidHuber-NOAA , and @aerorahul : have any other g-w components reported increased wall times on Orion Rocky 9?

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA The only other component that I know of that has run their tests is UFS_Utils (ufs-community/UFS_UTILS#966). No wall times had to be increased for that PR.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA for your reply. I don't know what to try next or who to contact. Does the spack-stack team run unit tests for their installations? It would be good to get confirmation that Orion Rocky-9 modules run as fast as their Centos-7 counterparts.

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jun 26, 2024

Agreed.

@AlexanderRichert-NOAA we are seeing significant slowdowns in the GSI on Orion after the OS upgrade to Rocky-9 and also significantly slower runtimes compared to Hercules. Do you know if tests were run on the spack-stack libraries and/or if they are not as optimized as they were under CentOS?

@RussTreadon-NOAA
Copy link
Contributor Author

Work for this issue will be done in RussTreadon-NOAA:feature/orion_rocky9

@RussTreadon-NOAA
Copy link
Contributor Author

@TingLei-NOAA , have you built gsi.x or enkf.x and run either executable on Orion? I am trying to collect more data points regarding post Rocky 9 wall times.

@RussTreadon-NOAA
Copy link
Contributor Author

@TingLei-NOAA , have you built gsi.x or enkf.x and run either executable on Orion? I am trying to collect more data points regarding post Rocky 9 wall times.

Learn that @TingLei-NOAA is on leave.

@RussTreadon-NOAA
Copy link
Contributor Author

Open ticket RDHPCS#2024062754000098 with Orion Helpdesk to report gsi.x and enkf.x slowdown on Orion Rocky-9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants