Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

g-w CI C96C48_hybatmaerosnowDA fails on WCOSS2 #1336

Closed
Tracked by #1342
RussTreadon-NOAA opened this issue Oct 17, 2024 · 22 comments · Fixed by #1435
Closed
Tracked by #1342

g-w CI C96C48_hybatmaerosnowDA fails on WCOSS2 #1336

RussTreadon-NOAA opened this issue Oct 17, 2024 · 22 comments · Fixed by #1435

Comments

@RussTreadon-NOAA
Copy link
Contributor

When g-w CI C96C48_hybatmaerosnowDA is run using g-w PR #2978, the following jobs abort on WCOSS2 (Cactus)

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
202112201200        gdas_aeroanlgenb                   158495445                DEAD                 -29         2        1849.0
202112201800            gdas_snowanl                   158472339                DEAD                   1         2          58.0

gdas_aeranlgenb aborts while executing gdas.x fv3jedi convertstate using chem_convertstate.yaml

nid003614.cactus.wcoss2.ncep.noaa.gov 0: Converting state 1 of 1
nid003614.cactus.wcoss2.ncep.noaa.gov 32:
FATAL from PE    32: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./bkg/20211220.180000.anlres.fv_tracer.res.tile3.nc variable:xaxis_1

nid003614.cactus.wcoss2.ncep.noaa.gov 64:
FATAL from PE    64: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./bkg/20211220.180000.anlres.fv_tracer.res.tile5.nc variable:xaxis_1

gdas_snowanl aborts while executing gdas.x fv3jedi localensembleda using letkfoi.yaml

nid001614.cactus.wcoss2.ncep.noaa.gov 0: Local solver completed.
OOPS_STATS LocalEnsembleDA after solver             - Runtime:     10.48 sec,  Memory: total:     8.43 Gb, per task: min =     1.40 Gb, max =     1.41 Gb
nid001614.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./anl/snowinc.20211220.180000.sfc_data.tile1.nc variable:xaxis_1

nid001614.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./anl/snowinc.20211220.180000.sfc_data.tile1.nc variable:xaxis_1

The error message is the same for both failures.

These jobs successfully run to completion on Hera, Hercules, and Orion. GDASApp is built with newer intel compilers and different modules on these machines. It is not clear if the older intel/19 compiler or modulefules used on Cactus are the issue or if there is an actual bug in the JEDI code which needs to be fixed.

This issue is opened to document the WCOSS2 failure and its resolution.

@RussTreadon-NOAA
Copy link
Contributor Author

10/18/2024 update

Examine code in sorc/fv3-jedi/src/fv3jedi/IO/FV3Restart. Add prints to IOFms.cc, IOFms.interface.F90, fv3jedi_io_fms2_mod.f90. Create stand-alone script to execute fv3jedi_convertstate.xusing 20211220 12Z gdas_aeroanlgenb input. Reproduceillegal characters` failure above. Prints suggest code is working as intended. Nothing jumps out as being wrong in the code.

Cory found a unidata/netcdf issue reporting illegal characters which appeared to be related to the netcdf version. Beginning to think this may be the issue on WCOSS2. All other platforms build GDASApp with

load("parallel-netcdf/1.12.2")
load("netcdf-c/4.9.2")
load("netcdf-fortran/4.6.1")
load("netcdf-cxx4/4.3.1")

WCOSS2 uses

load("netcdf/4.7.4")

Find

load("netcdf-C/4.9.2")
load("pnetcdf-C/1.12.2")

on WCOSS2 but attempts to build with these have not yet been successful. Still working through various combinations of module versions to see if we can build GDASApp on WCOSS2 using newer netcdf versions.

It would be nice if WCOSS2 had available the same spack-stack used on NOAA RHDPCS machines.

@RussTreadon-NOAA
Copy link
Contributor Author

10/20/2024 update

Unable to find combination of hpc-stack modules to successfully build and/or run gdas.x for either of the failed C96C48_hybatmaerosnowDA jobs. Log into Acorn. Find spack-stack versions 1.6.0, 1.7.0, and 1.8.0. Will use RDHPCS modulefiles to see if we can develop a spack-stack based acorn.intel.lua that allows the failed C96C48_hybatmaerosnowDA jobs to successfully run to completion.

In the interim modify fv3-jedi CMakeLists.txt to make the FMS2_IO build a configurable cmake option via the following changes

@@ -122,7 +122,13 @@ if (NOT FV3_FORECAST_MODEL MATCHES GEOS AND NOT FV3_FORECAST_MODEL MATCHES UFS)
 endif()
 
 # fms
-set(HAS_FMS2_IO TRUE) # Set to FALSE if FMS2 IO unavailable (should be removed eventually)
+option(BUILD_FMS2_IO "Build fv3-jedi with FMS2_IO" ON)
+set(HAS_FMS2_IO TRUE)
+if (NOT BUILD_FMS2_IO)
+   set(HAS_FMS2_IO FALSE)
+endif()
+message("FV3-JEDI built with HAS_FMS2_IO set to ${HAS_FMS2_IO}")
+
 find_package(FMS 2023.04 REQUIRED COMPONENTS R4 R8)
 if (FV3_PRECISION MATCHES DOUBLE OR NOT FV3_PRECISION)
   add_library(fms ALIAS FMS::fms_r8)

FMS2_IO is the default build option. Adding -DBUILD_FMS2_IO=OFF to the GDASApp cmake results in a FMS_IO build.

Do the following:

  1. Make the above change to CMakeLists.txt in a working copy of g-w PR #2978 gdas.cd/sorc/fv3-jedi/CMakeLists.txt.

  2. Add wcoss2 section to GDASApp build.sh to toggle off the FMS2_IO build as shown below

@@ -112,6 +112,11 @@ if [[ $BUILD_TARGET == 'hera' ]]; then
   ln -sf $GDASAPP_TESTDATA/crtm $dir_root/bundle/test-data-release/crtm
 fi
 
+if [[ $BUILD_TARGET == 'wcoss2' ]]; then
+    export BUILD_FMS2_IO="OFF"
+    CMAKE_OPTS+=" -DBUILD_FMS2_IO=${BUILD_FMS2_IO}"
+fi
+
 # Configure
 echo "Configuring ..."
 set -x
  1. Rebuild GDASApp inside working copy of PR #2978 guillaumevernieres:feature/update_hashes

  2. rocotorewind and rocotoboot the failed C96C48_hybatmaerosnowDA jobs. As expected, both jobs successfully run to completion.

Modified CMakeLists.txt committed to NOAA_EMC fv3-jedi branch patch/fv3-jedi at 96dff77 .

If we are OK with the modified CMakeLists.txt approach as a short-term patch, I will update GDASApp branch patch/gwci to point at NOAA-EMC:fv3-jedi at 96dff77. Once this is done the sorc/gdas.cd hash in guillaumevernieres:feature/update_hashes can be updated to pull in this change and g-w C96C48_hybatmaerosnowDA reactivated again in wcoss2.

@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 test

Install guillaumevernieres:feature/update_hashes at e9fa90c on Cactus. Use GDASApp branch patch/gwci at 0325836 with sorc/fv3-jedi pointing at patch/fv3-jedi at 96dff77.

Run g-w CI on Cactus for

  • C96C48_hybatmDA - PSLOT = prgsi_pr2978
  • C96C48_ufs_hybatmDA - PSLOT = prjedi_pr2978
  • C96C48_hybatmaerosnowDA - PSLOT = praero_pr2978
  • C48mx500_3DVarAOWCDA - PSLOT = prwcda_pr2978

with results as follows

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Oct 21 2024 00:26:50    Oct 21 2024 00:40:16
202112210000        Done    Oct 21 2024 00:26:50    Oct 21 2024 02:50:12
202112210600        Done    Oct 21 2024 00:26:50    Oct 21 2024 02:30:18
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Oct 21 2024 00:26:52    Oct 21 2024 00:40:20
202402240000        Done    Oct 21 2024 00:26:52    Oct 21 2024 03:10:08
202402240600        Done    Oct 21 2024 00:26:52    Oct 21 2024 03:15:16
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201200        Done    Oct 21 2024 00:26:53    Oct 21 2024 00:45:17
202112201800        Done    Oct 21 2024 00:26:53    Oct 21 2024 01:45:16
202112210000        Done    Oct 21 2024 00:26:53    Oct 21 2024 03:40:16
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241200        Done    Oct 21 2024 00:26:55    Oct 21 2024 00:40:26
202103241800      Active    Oct 21 2024 00:26:55             -          

The WCDA failure is the same as before

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
202103241800         gdas_marinebmat                   158838013                DEAD                 -29         2        1852.0

This failure is not related to changes in g-w PR . This failure also occurs when using GDASApp develop. GDASApp issue #1331 is tracking the WCOSS2 WCDA failure.

@RussTreadon-NOAA
Copy link
Contributor Author

spack-stack update

Commit acorn.intel.lua to GDASApp branch feature/build at f3fe406. acorn.intel.lua began as a copy of hera.intel.lua with paths and modulefiles updated to work on Acorn. Developer queues have been turned off on Acorn due to system work so I can not test & confirm that the executables work. I'll do so after Acorn returns to service.

spack-stack is not yet on WCOSS2 but EIB is in discussions to see this happen.

@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA , @guillaumevernieres , @danholdaway , @DavidNew-NOAA : Are we OK with thefollowing incremental approach?

First,

  1. modify NOAA-EMC:fv3-jedi CMakeLists.txt to make FMS2_IO a cmake configurable option. FMS2_IO is active by default
  2. modify build.sh so that on WCOSS2 we toggle off FMS2_IO and use FMS_IO. This allows C96C48_hybatmaerosnowDA to run to completion.
  3. Update the gdas.cd hash in g-w PR #2978 to bring in the above two changes
  4. Activate C96C48_hybatmaerosnowDA on WCOSS2 in PR #2978

Second,
Once spack-stack is installed on WCOSS2, use the acorn.intel.lua in GDASApp branch feature/build to update wcoss2.intel.lua. Using spack-stack on WCOSS2 will hopefully allow us to run C96C48_hybatmaerosnowDA with FMS2_IO.

If we are OK with the items under First, I'll get to work and make it so.

@DavidNew-NOAA
Copy link
Collaborator

DavidNew-NOAA commented Oct 21, 2024

@RussTreadon-NOAA Fine by me, but FYI NOAA-EMC/global-workflow#2949 will not work on WCOSS when it that PR is merged. The FMS2 IO module in FV3-JEDI also includes non-restart read/write capability which is needed for native grid increments in that PR. Hopefully we sort the FMS2 IO issue out before it goes into review.

This PR won't hold that up, because FMS2 IO isn't working anyway on WCOSS. Like I said, just an FYI.

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidNew-NOAA , does your comment

because FMS2 IO isn't working anyway on WCOSS

refer the fact that ...

  1. select C96C48_hybatmaerosnowDA jobs in g-w PR #2978 fail with FMS2_IO, or
  2. g-w PR #2949 has been tested on WCOSS2 and found to not work

@RussTreadon-NOAA
Copy link
Contributor Author

I don't have a WCOSS2 spack-stack implementation time line, but my guess is that it will not be available on WCOSS2 before g-w PR #2949 is reviewed and merged.

We face a decision for WCOSS2 GDASApp builds:

  1. accept for the time being that at least parts of JEDI aerosol and snow DA do not work on WCOSS2, or
  2. restore JEDI aerosol and snow DA functionality at the expense of breaking the functionality added by g-w PR #2949

Of course, if we can find a combination of existing WCOSS2 modules that work with FMS2_IO, choices 1 and 2 become moot. Thus far, I have not been able to find this combination.

@CoryMartin-NOAA
Copy link
Contributor

My preference, while not ideal, is 2, as we have relatively soon deadlines for aero/snow and not for atm cycling. Do we know for sure it's a library issue?

@RussTreadon-NOAA
Copy link
Contributor Author

Can't say for sure but I studied the fv3-jedi fms2 code in depth on Thu-Fri with lots of prints added. Nothing jumps out as being wrong. The code as_is works fine on Hera, Hercules, and Orion. These machines build GDASApp with newer intel compilers and spack-stack. Hence the hypothesis that the Cactus failures are due to the older intel compiler and/or the hpc-stack modules we load.

Once Acorn queues are opened I can run a build of g-w PR #2978 with GDASApp using spack-stack/1.6.0 (same version we use on NOAA RDHPCS) and see if the failing Cactus jobs run OK.

@DavidNew-NOAA
Copy link
Collaborator

@DavidNew-NOAA , does your comment

because FMS2 IO isn't working anyway on WCOSS

refer the fact that ...

  1. select C96C48_hybatmaerosnowDA jobs in g-w PR #2978 fail with FMS2_IO, or

  2. g-w PR #2949 has been tested on WCOSS2 and found to not work

@RussTreadon-NOAA I assume it atmospheric cycling will not work on WCOSS, because g-w PR #2949 will reintroduce FMS (2). Currently atmospheric cycling uses cubed sphere histories to write increments.

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA I was able to get the convertstate job to run to completion using develop but with the changes in this branch: https://github.com/NOAA-EMC/GDASApp/tree/bugfix/wcoss2
I manually submitted a job using a staged directory from your RUNDIRS/

Please note that the compile/build step will fail on the IMS Snow Proc linking, not sure why yet, but I think this is progress.

@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Oct 22, 2024

@CoryMartin-NOAA , this is good new.

How did you get the convertstate executable if build.sh aborted when building land-imsproc?

I traveled down this road earlier and like you got a bunch of undefined netcdf references

cd /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/pr2978/sorc/gdas.cd/build/land-imsproc && /apps/spack/cmake/3.20.2/intel/19.1.3.304/utnbptm3hrf7gppztidueu4jogfgemut/bin/cmake -E cmake_link_script CMakeFiles/calcfIMS.exe.dir/link.txt --verbose=NO
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../lib/libimsproc.so: undefined reference to `nc_inq_unlimdim'
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../lib/libimsproc.so: undefined reference to `nc_def_var_fill'
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../lib/libimsproc.so: undefined reference to `nc_get_vara_schar'
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../lib/libimsproc.so: undefined reference to `nc_def_var_szip'

...

/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../lib/libimsproc.so: undefined reference to `nc_inq_user_type'
make[2]: *** [land-imsproc/CMakeFiles/calcfIMS.exe.dir/build.make:100: bin/calcfIMS.exe] Error 1
make[2]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/pr2978/sorc/gdas.cd/build'
make[1]: *** [CMakeFiles/Makefile2:28825: land-imsproc/CMakeFiles/calcfIMS.exe.dir/all] Error 2
make[1]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/pr2978/sorc/gdas.cd/build'
make: *** [Makefile:166: all] Error 2

I found way to get pass this by adding netcdf to the modulefile but then the executable failed with run time errors.

When my build above failed, I didn't find fv3jedi_convertstate.x or gdas.x in build/bin

@CoryMartin-NOAA
Copy link
Contributor

my build included gdas.x before it failed, perhaps that is/was the luck of the draw? I'll dig in to see why the IMS code is not linking to netCDF properly

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA try the lastest commit to that GDASApp branch. I'm able to get IMS to compile now within the DA-Utils repo. If this works, I can clean it all up. I'll try it from scratch now.

@RussTreadon-NOAA
Copy link
Contributor Author

Clone GDASApp bugfix/wcoss2 at 0675212. Build GDASApp using g-w build_gdas.sh. Build successful.

Rewind and reboot 20211220 12Z gdas_aeroanlgenb. All executable in this job run to completion.

Rewind and reboot 20211220 18Z gdas_snowanl. The job failed with

nid001781.cactus.wcoss2.ncep.noaa.gov 0: Exception:     Reason: The Bufr engine is disabled.
        source_column:  0
        source_filename:        /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/pr2978/sorc/gdas.cd/bundle/ioda/src/engines/ioda/src/ioda/Engines/Bufr/Bufr.cpp
        source_function:        ioda::ObsGroup ioda::Engines::Bufr::openFile(const ioda::Engines::Bufr::Bufr_Parameters &, ioda::Group)
        source_line:    144
        stacktrace:

This is not surprising. bugfix/wcoss2 at 0675212 disabled bufr-query, land-imsproc, & land-jediincr.

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA this is encouraging. I'm looking into the CMake, I think there's an issue with the static libraries and it is only linking the netCDF fortran (or in bufr-query's case, netCDF-C++), and not the other required dependencies.

@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA : Do we need to add a missing CXX, C, or Fortran to CMakeLists.txt line find_package( NetCDF REQUIRED COMPONENTS in sorc/bufr-query, sorc/land-imsproc, and sorc/land-jediincr?

@CoryMartin-NOAA
Copy link
Contributor

I think all of the above + HDF5 , but I'm looking into seeing if there is a simpler fix. I thought nc-config was supposed to include these flags

@RussTreadon-NOAA
Copy link
Contributor Author

Cactus spack-stack/1.6.0 test

@DavidHuber-NOAA pointed me at at a test spack-stack/1.6.0 installation on Cactus. After a bit of trial and error found a combination of modules and versions which build GDASApp on Cactus using spack-stack/1.6.0. Reruns of failed gdas_aeroanlgenb and gdas_snowanl jobs successfully ran to completion.

Given this success, set up g-w CI for four DA configurations

  • C96C48_hybatmDA - PSLOT = prgsi_pr2978
  • C96C48_ufs_hybatmDA - PSLOT = prjedi_pr2978
  • C96C48_hybatmaerosnowDA - PSLOT = praero_pr2978
  • C48mx500_3DVarAOWCDA - PSLOT = prwcda_pr2978

Note: g-w used for these tests is feature/update_hashes at 17cf1eb8. This was merged into g-w develop on 10/26/2024 at 2fdfaa5. GDASApp modulefile wcoss2.intel.lua was modifield to use spack-stack/1.6.0.

All four g-w CI streams are still running. JEDI based DA streams have completed first cycle JEDI DA jobs. Of particular interest are the successful completion of praero_pr2978

202112201200        gdas_aeroanlgenb                   159596067           SUCCEEDED                   0         1          80.0
202112201800            gdas_snowanl                   159596883           SUCCEEDED                   0         1          49.0
202112201800             gfs_snowanl                   159596885           SUCCEEDED                   0         1          46.0

An unexpected bonus is that prwcda_pr2978 completed all JEDI based DA jobs

202103241800       gdas_prepoceanobs                   159595816           SUCCEEDED                   0         1         183.0
202103241800      gdas_marineanlinit                   159596100           SUCCEEDED                   0         1          30.0
202103241800         gdas_marinebmat                   159595817           SUCCEEDED                   0         1          53.0
202103241800       gdas_marineanlvar                   159596602           SUCCEEDED                   0         1          80.0
202103241800     gdas_marineanlchkpt                   159596888           SUCCEEDED                   0         1          42.0
202103241800     gdas_marineanlfinal                   159597178           SUCCEEDED                   0         1          33.0

This issue will be updated once all four g-w CI DA streams complete.

@RussTreadon-NOAA
Copy link
Contributor Author

Cactus g-w CI
g-w CI for DA streams completed on Cactus with all jobs successful

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Oct 28 2024 17:21:54    Oct 28 2024 17:35:07
202112210000        Done    Oct 28 2024 17:21:54    Oct 28 2024 20:00:10
202112210600        Done    Oct 28 2024 17:21:54    Oct 28 2024 20:00:10
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Oct 28 2024 17:21:57    Oct 28 2024 17:37:24
202402240000        Done    Oct 28 2024 17:21:57    Oct 28 2024 20:28:46
202402240600        Done    Oct 28 2024 17:21:57    Oct 28 2024 20:29:10
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201200        Done    Oct 28 2024 17:22:01    Oct 28 2024 17:38:30
202112201800        Done    Oct 28 2024 17:22:01    Oct 28 2024 20:22:16
202112210000        Done    Oct 28 2024 17:22:01    Oct 28 2024 20:18:10
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241200        Done    Oct 28 2024 17:22:03    Oct 28 2024 17:37:32
202103241800      Active    Oct 28 2024 17:22:03             -          

WCDA CI remains active because 20210324 18Z gdas_prep is waiting for /lfs/h2/emc/dump/noscrub/dump/gdas.20210324/18/atmos/gdas.t18z.updated.status.tm00.bufr_d. g-w issue #3039 requests that this and other dump files required for WCDA g-w CI be copied from Hera to WCOSS2.

@RussTreadon-NOAA
Copy link
Contributor Author

This issue can be closed once spack-stack/1.6.0 is officially installed on WCOSS2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants