Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update wcoss2.intel.lua to spack-stack #1435

Merged
merged 7 commits into from
Jan 10, 2025

Conversation

RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented Jan 8, 2025

Description

This PR updates wcoss2.intel.lua to use spack-stack/1.6.0.

Companion PRs

none

Issues

Resolves #1350
Resolves #1331
Resolves #1336
Resolves g-w #3100

Automated CI tests to run in Global Workflow

  • atm_jjob
  • C96C48_ufs_hybatmDA
  • C96C48_hybatmaerosnowDA
  • C48mx500_3DVarAOWCDA
  • C48mx500_hybAOWCDA
  • C96C48_hybatmDA

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Jan 8, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

g-w CI has been successfully run on WCOSS2 (Cactus). All jobs in all tested g-w configurations completed without error. Details are found in issue #1350.

The hera-GW-RT label will be applied to this PR as a sanity check. The changes in this PR should not impact the building or execution of GDASApp components on NOAA RHPDCS machines. This PR only impacts WCOSS2 (Cactus and Dogwood).

@RussTreadon-NOAA RussTreadon-NOAA added the hera-GW-RT Queue for automated testing with global-workflow on Hera label Jan 8, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

Assuming hera-GW-RT passes, this PR will be marked Ready for review

@emcbot emcbot added hera-GW-RT-Running Automated testing with global-workflow running on Hera and removed hera-GW-RT Queue for automated testing with global-workflow on Hera labels Jan 8, 2025
@emcbot
Copy link

emcbot commented Jan 8, 2025

Automated GW-GDASApp Testing Results:
Machine: hera

Start: Wed Jan  8 21:12:39 UTC 2025 on hfe03
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Wed Jan  8 22:03:21 UTC 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp -E atm_jjob|C96C48_ufs_hybatmDA|C96C48_hybatmaerosnowDA|C48mx500_3DVarAOWCDA|C48mx500_hybAOWCDA|C96C48_hybatmDA
Tests:                                 *SUCCESS*
Tests: Completed at Wed Jan  8 22:04:22 UTC 2025
Tests: 100% tests passed, 0 tests failed out of 32

@emcbot emcbot added hera-GW-RT-Passed Automated testing with global-workflow successful on Hera and removed hera-GW-RT-Running Automated testing with global-workflow running on Hera labels Jan 8, 2025
@RussTreadon-NOAA RussTreadon-NOAA added the orion-GW-RT Queue for automated testing with global-workflow on Orion label Jan 8, 2025
@emcbot emcbot added orion-GW-RT-Running Automated testing with global-workflow running on Orion and removed orion-GW-RT Queue for automated testing with global-workflow on Orion labels Jan 8, 2025
@emcbot
Copy link

emcbot commented Jan 8, 2025

Automated GW-GDASApp Testing Results:
Machine: orion

Start: Wed Jan  8 04:22:07 PM CST 2025 on orion-login-1.hpc.msstate.edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Wed Jan  8 05:19:45 PM CST 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                  *Failed*
Tests: Failed at Wed Jan  8 05:43:31 PM CST 2025
Tests: 37% tests passed, 85 tests failed out of 134
	1959 - test_gdasapp_C96C48_hybatmDA_gdas_stage_ic_202112201800 (Failed)
	1960 - test_gdasapp_C96C48_hybatmDA_gdas_fcst_202112201800 (Failed)
	1961 - test_gdasapp_C96C48_hybatmDA_gdas_atmos_prod_202112201800 (Failed)
	1962 - test_gdasapp_C96C48_hybatmDA_enkfgdas_stage_ic_202112201800 (Failed)
	1963 - test_gdasapp_C96C48_hybatmDA_enkfgdas_fcst_202112201800 (Failed)
	1964 - test_gdasapp_C96C48_hybatmDA_enkfgdas_echgres_202112201800 (Failed)
	1965 - test_gdasapp_C96C48_hybatmDA_enkfgdas_epmn_202112201800 (Failed)
	1966 - test_gdasapp_C96C48_hybatmDA_gdas_prep_202112210000 (Failed)
	1967 - test_gdasapp_C96C48_hybatmDA_gdas_anal_202112210000 (Failed)
	1968 - test_gdasapp_C96C48_hybatmDA_gdas_sfcanl_202112210000 (Failed)
	1969 - test_gdasapp_C96C48_hybatmDA_gdas_analcalc_202112210000 (Failed)
	1970 - test_gdasapp_C96C48_hybatmDA_gdas_fcst_202112210000 (Failed)
	1971 - test_gdasapp_C96C48_hybatmDA_enkfgdas_eobs_202112210000 (Failed)
	1972 - test_gdasapp_C96C48_hybatmDA_enkfgdas_ediag_202112210000 (Failed)
	1973 - test_gdasapp_C96C48_hybatmDA_enkfgdas_eupd_202112210000 (Failed)
	1974 - test_gdasapp_C96C48_hybatmDA_enkfgdas_ecmn_202112210000 (Failed)
	1975 - test_gdasapp_C96C48_hybatmDA_enkfgdas_esfc_202112210000 (Failed)
	1976 - test_gdasapp_C96C48_hybatmDA_enkfgdas_fcst_202112210000 (Failed)
	1978 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_stage_ic_202402231800 (Failed)
	1979 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_fcst_202402231800 (Failed)
	1980 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_atmos_prod_202402231800 (Failed)
	1981 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_stage_ic_202402231800 (Failed)
	1982 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_fcst_202402231800 (Failed)
	1983 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_echgres_202402231800 (Failed)
	1984 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_epmn_202402231800 (Failed)
	1985 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_prep_202402240000 (Failed)
	1986 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_prepatmiodaobs_202402240000 (Failed)
	1987 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_atmanlinit_202402240000 (Failed)
	1988 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_atmanlvar_202402240000 (Failed)
	1989 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_atmanlfv3inc_202402240000 (Failed)
	1990 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_atmanlfinal_202402240000 (Failed)
	1991 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_sfcanl_202402240000 (Failed)
	1992 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_analcalc_202402240000 (Failed)
	1993 - test_gdasapp_C96C48_ufs_hybatmDA_gdas_fcst_202402240000 (Failed)
	1994 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_atmensanlinit_202402240000 (Failed)
	1995 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_atmensanlobs_202402240000 (Failed)
	1996 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_atmensanlsol_202402240000 (Failed)
	1997 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_atmensanlfv3inc_202402240000 (Failed)
	1998 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_atmensanlfinal_202402240000 (Failed)
	1999 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_ecmn_202402240000 (Failed)
	2000 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_esfc_202402240000 (Failed)
	2001 - test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_fcst_202402240000 (Failed)
	2003 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_stage_ic_202112201200 (Failed)
	2004 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_fcst_202112201200 (Failed)
	2005 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_atmos_prod_202112201200 (Failed)
	2006 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_aeroanlgenb_202112201200 (Failed)
	2007 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_stage_ic_202112201200 (Failed)
	2008 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_fcst_202112201200 (Failed)
	2009 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_echgres_202112201200 (Failed)
	2010 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_epmn_202112201200 (Failed)
	2011 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_prep_202112201800 (Failed)
	2012 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_anal_202112201800 (Failed)
	2013 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_aeroanlinit_202112201800 (Failed)
	2014 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_aeroanlvar_202112201800 (Failed)
	2015 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_aeroanlfinal_202112201800 (Failed)
	2016 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_snowanl_202112201800 (Failed)
	2017 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_sfcanl_202112201800 (Failed)
	2018 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_analcalc_202112201800 (Failed)
	2019 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_fcst_202112201800 (Failed)
	2020 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_eobs_202112201800 (Failed)
	2021 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_ediag_202112201800 (Failed)
	2022 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_eupd_202112201800 (Failed)
	2023 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_ecmn_202112201800 (Failed)
	2024 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_esnowanl_202112201800 (Failed)
	2025 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_esfc_202112201800 (Failed)
	2026 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_fcst_202112201800 (Failed)
	2028 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_stage_ic_202103241800 (Failed)
	2029 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_fcst_202103241800 (Failed)
	2030 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_prepoceanobs_202103250000 (Failed)
	2031 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_marinebmat_202103250000 (Failed)
	2032 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_marineanlinit_202103250000 (Failed)
	2033 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_marineanlvar_202103250000 (Failed)
	2034 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_marineanlchkpt_202103250000 (Failed)
	2035 - test_gdasapp_C48mx500_3DVarAOWCDA_gdas_marineanlfinal_202103250000 (Failed)
	2037 - test_gdasapp_C48mx500_hybAOWCDA_gdas_stage_ic_202103241800 (Failed)
	2038 - test_gdasapp_C48mx500_hybAOWCDA_gdas_fcst_202103241800 (Failed)
	2039 - test_gdasapp_C48mx500_hybAOWCDA_enkfgdas_stage_ic_202103241800 (Failed)
	2040 - test_gdasapp_C48mx500_hybAOWCDA_enkfgdas_fcst_202103241800 (Failed)
	2041 - test_gdasapp_C48mx500_hybAOWCDA_gdas_prepoceanobs_202103250000 (Failed)
	2042 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlletkf_202103250000 (Failed)
	2043 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marinebmat_202103250000 (Failed)
	2044 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlinit_202103250000 (Failed)
	2045 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlvar_202103250000 (Failed)
	2046 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlchkpt_202103250000 (Failed)
	2047 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlfinal_202103250000 (Failed)
Tests: see output at /work2/noaa/da/role-da/CI/orion/GDASApp/workflow/PR/1435/global-workflow/sorc/gdas.cd/build/log.ctest

@emcbot emcbot added orion-GW-RT-Failed Automated testing with global-workflow failed on Orion and removed orion-GW-RT-Running Automated testing with global-workflow running on Orion labels Jan 8, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

@DavidNew-NOAA : The failure of the g-w based gdasapp ctests is due to role-da not belonging to the stmp group.

orion-login-2:~$ groups role-da
role-da : noaa-hpc da-cpu da rstprod

For example, /work2/noaa/da/role-da/CI/orion/GDASApp/workflow/PR/1435/global-workflow/sorc/gdas.cd/build/gdas/test/gw-ci/C96C48_hybatmDA/COMROOT/C96C48_hybatmDA/logs/2021122018/gdas_stage_ic.log failed with

++ jjob_header.sh[72]: mkdir -p /work/noaa/stmp/role-da/ORION/RUNDIRS/C96C48_hybatmDA/gdas.2021122018/stage_ic.3647470
mkdir: cannot create directory ‘/work/noaa/stmp/role-da’: Permission denied
+ jjob_header.sh[1]: postamble JGLOBAL_STAGE_IC 1736378490 1
+ preamble.sh[70]: set +x
End JGLOBAL_STAGE_IC at 23:21:31 with error code 1 (time elapsed: 00:00:01)

@RussTreadon-NOAA
Copy link
Contributor Author

Ticket RDHPCS#2025010854000354 opened with MSU help desk requesting role-da be added to the stmp group.

Attempts to run g-w based GDASApp ctests from the Hercules role-da account will fail in a similar manner.

For the time being we should not trigger g-w based GDASApp ctests for the following configurations

 C96C48_ufs_hybatmDA
 C96C48_hybatmaerosnowDA
 C48mx500_3DVarAOWCDA
 C48mx500_hybAOWCDA
 C96C48_hybatmDA

using the role-da account on Hercules or Orion.

Developers belonging to the MSU stmp group can run these g-w based GDASApp ctests on Hercules and Orion.

FYI, g-w sets the default value for variable STMP in machine specific yaml files found in $HOMEgfs/workflow/hosts/

workflow/hosts/gaea.yaml:STMP: '/gpfs/f5/ufs-ard/scratch/${USER}'
workflow/hosts/awspw.yaml:STMP: '/lustre/${USER}/stmp/'
workflow/hosts/googlepw.yaml:STMP: '/lustre/${USER}/stmp/'
workflow/hosts/orion.yaml:STMP: '/work/noaa/stmp/${USER}/ORION'
workflow/hosts/orion.yaml:PTMP: '/work/noaa/stmp/${USER}/ORION'
workflow/hosts/wcoss2.yaml:STMP: '/lfs/h2/emc/stmp/${USER}'
workflow/hosts/azurepw.yaml:STMP: '/lustre/${USER}/stmp/'
workflow/hosts/jet.yaml:STMP: '/lfs5/HFIP/hfv3gfs/${USER}/stmp'
workflow/hosts/container.yaml:STMP: '/home/${USER}'
workflow/hosts/s4.yaml:STMP: '/scratch/users/${USER}'
workflow/hosts/hera.yaml:STMP: '/scratch1/NCEPDEV/stmp2/${USER}'
workflow/hosts/hera.yaml:PTMP: '/scratch1/NCEPDEV/stmp4/${USER}'
workflow/hosts/hercules.yaml:STMP: '/work/noaa/stmp/${USER}/HERCULES'
workflow/hosts/hercules.yaml:PTMP: '/work/noaa/stmp/${USER}/HERCULES'

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review January 9, 2025 11:37
@RussTreadon-NOAA
Copy link
Contributor Author

Mark this PR as Ready for review because g-w CI passes on WCOSS2 (Cactus). The focus of this PR is updating the GDASApp build to WCOSS2 spack-stack. This works.

@RussTreadon-NOAA RussTreadon-NOAA added orion-GW-RT Queue for automated testing with global-workflow on Orion and removed orion-GW-RT-Failed Automated testing with global-workflow failed on Orion labels Jan 10, 2025
@emcbot emcbot added orion-GW-RT-Running Automated testing with global-workflow running on Orion and removed orion-GW-RT Queue for automated testing with global-workflow on Orion labels Jan 10, 2025
Copy link
Contributor

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I was able to build on WCOSS on Wednesday with this branch

@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request Jan 10, 2025
6 tasks
@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 issues

Two items to be aware of following the WCOSS2 upgrade:

  1. The WCOSS2 upgrade added the following options to intel fortran (ftn) builds when using craype/2.7.17 (This is the recommended craype version to use following the upgrade.)
-static-libgcc -static-libstdc++ -Bstatic -lstdc++ -Bdynamic -lm -lpthread

This change impacts GDASApp executables. They no longer find GDASApp libraries. We need to add the GDASApp build path to LD_LIBRARY_PATH.

  1. Some WCOSS2 system modules are not properly configured following the system upgrade. GDIT is aware of this and is working to fix the mismatches. This problem can be temporarily resolved by adding a cray-mpich path to LD_LIBRARY_PATH

The following lines need to be added to g-w env/WCOSS2.env so that GDASApps properly find libraries

@@ -13,6 +13,10 @@ step=$1
 export launcher="mpiexec -l"
 export mpmd_opt="--cpu-bind verbose,core cfp"
 
+# Add path to GDASApp libraries
+export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOMEgfs/sorc/gdas.cd/build/lib"
+export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/cray/pe/mpich/8.1.19/ofi/intel/19.0/lib"
+
 # Calculate common resource variables
 # Check first if the dependent variables are set
 if [[ -n "${ntasks:-}" && -n "${max_tasks_per_node:-}" && -n "${tasks_per_node:-}" ]]; then

It is also recommended that we add the following to g-w ush/module-setup.sh

@@ -51,6 +51,8 @@ elif [[ ${MACHINE_ID} = s4* ]] ; then
 
 elif [[ ${MACHINE_ID} = wcoss2 ]]; then
     # We are on WCOSS2
+    # Ignore default modules of the same version lower in the search path (req'd by spack-stack)
+    export LMOD_TMOD_FIND_FIRST=yes
     module reset
 
 elif [[ ${MACHINE_ID} = cheyenne* ]] ; then

@emcbot
Copy link

emcbot commented Jan 10, 2025

Automated GW-GDASApp Testing Results:
Machine: orion

Start: Fri Jan 10 01:48:07 PM UTC 2025 on orion-login-2.hpc.msstate.edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Fri Jan 10 02:47:43 PM UTC 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                 *SUCCESS*
Tests: Completed at Fri Jan 10 04:29:25 PM UTC 2025
Tests: 100% tests passed, 0 tests failed out of 134

@emcbot emcbot added orion-GW-RT-Passed Automated testing with global-workflow successful on Orion and removed orion-GW-RT-Running Automated testing with global-workflow running on Orion labels Jan 10, 2025
@RussTreadon-NOAA RussTreadon-NOAA merged commit d6277a4 into develop Jan 10, 2025
7 checks passed
@RussTreadon-NOAA RussTreadon-NOAA deleted the feature/wcoss2_spack-stack branch January 10, 2025 16:35
@aerorahul
Copy link
Contributor

Are we cleared to use spack-stack on WCOSS2 for regular development.
AFAIK, this instance of the stack is stationary and will not be updated with newer versions of libraries.
This is just an FYI

@RussTreadon-NOAA
Copy link
Contributor Author

@aerorahul : Understood. NCO and GDIT staff are aware that GDASApp is using this version of WCOSS2 spack-stack. A developer GSI fork contains a branch using the same WCOSS2 spack-stack. We've been testing the NCO implementation, identifying missing modules, and pointing out other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hera-GW-RT-Passed Automated testing with global-workflow successful on Hera orion-GW-RT-Passed Automated testing with global-workflow successful on Orion
Projects
None yet
4 participants