Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing ESMF 8.5.0 release #1854

Closed
junwang-noaa opened this issue Aug 3, 2023 · 79 comments · Fixed by #2013
Closed

Testing ESMF 8.5.0 release #1854

junwang-noaa opened this issue Aug 3, 2023 · 79 comments · Fixed by #2013
Assignees
Labels
enhancement New feature or request

Comments

@junwang-noaa
Copy link
Collaborator

Description

The ESMF 8.5.0 was released. We need to test it in ufs-weather-model.

Solution

Alternatives

Related to

@junwang-noaa junwang-noaa added the enhancement New feature or request label Aug 3, 2023
@uturuncoglu
Copy link
Collaborator

@junwang-noaa JFYI, The PIO library also need to be updated to 2.5.10 to be consistent with the ESMF internal PIO version. If I use 2.5.7 along with ESMF 8.5.0, the cpld_control_p8 build fails with following error,

/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/pio/2.5.7/lib/libpioc.a(pioc.c.o): In function `PIOc_iosystem_is_active':
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/src-intel-2022.1.2/pkg/pio-2.5.7/src/clib/pioc.c:97: multiple definition of `PIOc_iosystem_is_active'
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/esmf/8.5.0/lib/libpioc.a(pioc.c.o):/work/noaa/epic/role-epic/contrib/orion/hpc-stack/src-intel-2022.1.2/pkg/v8.5.0/src/Infrastructure/IO/PIO/ParallelIO/src/clib/pioc.c:97: first defined here
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/pio/2.5.7/lib/libpioc.a(pioc.c.o): In function `PIOc_File_is_Open':
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/src-intel-2022.1.2/pkg/pio-2.5.7/src/clib/pioc.c:111: multiple definition of `PIOc_File_is_Open'
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/esmf/8.5.0/lib/libpioc.a(pioc.c.o):/work/noaa/epic/role-epic/contrib/orion/hpc-stack/src-intel-2022.1.2/pkg/v8.5.0/src/Infrastructure/IO/PIO/ParallelIO/src/clib/pioc.c:111: first defined here

@uturuncoglu
Copy link
Collaborator

@junwang-noaa I think ESMF build uses internal PIO at this point? Right? If so, other option is building ESMF with external PIO installation but still we need to update PIO to 2.5.10.

@junwang-noaa
Copy link
Collaborator Author

@natalie-perlin We are updating the model to use pio 2.5.10/hdf5 1.14.0/netcdf 4.9.2 today or early next week. Would you please install ESMF 8.5.0 with those updated libraries on orion and hera for us to test ufs weather model. Thanks

@natalie-perlin
Copy link
Collaborator

@junwang-noaa -

ESMF/8.5.0 libraries and dependent mapl/2.xx.x-esmf-8.5.0 have been installed on Orion and Hera. Please note that due to a mandatory transition to a new role account (role-epic) on Orion and new space for that account, all the stacks were rebuilt in new location under /work/noaa/epic/role-epic/contrib/orion/.

Orion stack with hdf5/1.14.0, netcdf/4.9.2, pio/2.5.10, (also includes esmf/8.4.0 and mapl built with it)
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2_ncdf492

Hera intel:
/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2_ncdf492/
Hera gnu:
/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_ncdf492

@uturuncoglu
Copy link
Collaborator

uturuncoglu commented Aug 4, 2023

JFYI, I could able to run the cpld_control_p8 case on Orion with *_ncdf492 modules and ESMF 8.5.0 but aerosol component is giving error like following,

20230804 121636.886 INFO             PET000 UFS Aerosols: Advancing from 2021-03-22T06:00:00 to 2021-03-22T06:12:00
20230804 121654.086 ERROR            PET000 /work/noaa/epic/role-epic/contrib/orion/hpc-stack/src-intel-2022.1.2_ncdf492/pkg/v8.5.0/src/Infrastructure/Base/src/ESMCI_Info.C:830 Info::get(): Key not found (JSON trace will follow): /ESMF/General/GridType
20230804 121654.086 ERROR            PET000 ESMCI_Info.C:832 Info::get() Attribute not set  - [json.exception.out_of_range.403] key 'ESMF' not found
20230804 121654.086 ERROR            PET000 ESMCI_Info.C:836 Info::get() Attribute not set  - Internal subroutine call returned Error
20230804 121654.086 ERROR            PET000 ESMC_InfoCDef.C:601 ESMC_InfoGetCH() Attribute not set  - Internal subroutine call returned Error
20230804 121654.086 ERROR            PET000 ESMF_Info.F90:986 ESMF_InfoGetCH() Attribute not set  - Internal subroutine call returned Error
20230804 121654.086 ERROR            PET000 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:27682 ESMF_AttributeGetObjGridCH() Attribute not set  - Internal subroutine call returned Error
20230804 121654.087 ERROR            PET000 Aerosol_Cap.F90:464 Invalid object  - Passing error in return code
20230804 121654.087 ERROR            PET000 CHM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:2218 Invalid object  - Passing error in return code
20230804 121654.087 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3702 Invalid object  - Phase 'RunPhase1' Run for modelComp 3 did not return ESMF_SUCCESS
20230804 121654.087 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3940 Invalid object  - Passing error in return code
20230804 121654.087 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3617 Invalid object  - Passing error in return code
20230804 121654.087 ERROR            PET000 UFS.F90:403 Invalid object  - Aborting UFS

@uturuncoglu
Copy link
Collaborator

@junwang-noaa Let me know if you want me to look at.

@junwang-noaa
Copy link
Collaborator Author

@uturuncoglu The gocart is updated in PR#1745. I ran the cpld_control_gfsv17 on hera with ESMF850 with the branch from PR#1745, the test ran successfully.

@uturuncoglu
Copy link
Collaborator

@junwang-noaa Okay. Sorry for false alarm. Maybe my fork is not up-to-date with those changes. It is hard to catch the development. Everything is changing so fast.

@junwang-noaa
Copy link
Collaborator Author

We are committing PR#1745 at this time.

@uturuncoglu
Copy link
Collaborator

@junwang-noaa Okay. Then, I'll update my fork after that.

@uturuncoglu
Copy link
Collaborator

@junwang-noaa I just wonder if there is some update about testing 8.5.0

@junwang-noaa
Copy link
Collaborator Author

@uturuncoglu I ran full RT test with the ufs WM develop branch (cd8535b). All the test ran successfully, but 31 out of 249 tests failed. 4 of them failed in post file comparison with the baseline, while other 27 failed in history file comparison. My code is on hera at:

/scratch1/NCEPDEV/nems/Jun.Wang/nems/vlab/20230621/esmf850/repo_20230816/ufs-weather-model/

You can see the log file at:
/scratch1/NCEPDEV/nems/Jun.Wang/nems/vlab/20230621/esmf850/repo_20230816/ufs-weather-model/tests/logs/RegressionTests_hera.log

@natalie-perlin @theurich FYI.

@uturuncoglu
Copy link
Collaborator

@junwang-noaa I have no access to Hera. So, it would be hard for me to check the logs. Are the tests failed in post file comparison was only 3d files. If so, I saw that issue before in my tests. There could be some kind of interpolation etc. special to 3d files (on pressure level I think) in post processing tool that leads to answer change. In my case, surface files were fine. We know that 8.5.0 has some fix/improvement in some interpolation combination that could lead to answer change. The release notes in here, https://github.com/esmf-org/esmf/releases/tag/v8.5.0

@climbfuji
Copy link
Collaborator

Now that the UFS has moved to spack-stack, this testing needs to happen there. We can create a test installation of spack-stack develop in anticipation of the next spack-stack release 1.5.0 (September 8 +/-) that has ESMF 8.5.0 and the mapl version you want to go with (please specify).

Note that although we have parallelio-2.5.10 in spack-stack-1.5.0, it doesn't matter because we use the external parallelio provided by spack-stack, no internal parallelio (that is one of the major improvements I think).

@junwang-noaa
Copy link
Collaborator Author

@climbfuji Currently UFS WM is still using hpc-stack on cheyenne and wcoss2, and using spack-stack on hera/orion/gaea/jet. If you can install ESMF 8.5.0 in spack-stack on hera, we can test it in UFS WM.

@climbfuji
Copy link
Collaborator

@junwang-noaa I have a test stack on Hera with ESMF 8.5.0 and MAPL 2.40.3. I am getting errors from mapl in the cmake step that I am currently debugging with @mathomp4.

@mathomp4
Copy link

@junwang-noaa I have a test stack on Hera with ESMF 8.5.0 and MAPL 2.40.3. I am getting errors from mapl in the cmake step that I am currently debugging with @mathomp4.

I have thoughts about this. My fear is that we need to fix up how MAPL handles being installed as a CMake project. Mainly, we aren't propagating the dependencies down so find_package(MAPL) isn't then doing the find_dependency() calls needed.

I have a test branch of MAPL on hotfix/mathomp4/fix-mapl-cmake-package that might fix this...but I'm not sure how you exactly build GOCART in UFS.

I think the equivalent is if you changed GOCART's main CMakeLists.txt file from:

 if (UFS_GOCART)
  find_package (GFTL_SHARED REQUIRED)
  find_package (MAPL REQUIRED)
  include(mapl_acg)

to:

if (UFS_GOCART)
  find_package (GFTL_SHARED REQUIRED)
  find_package (FARGPARSE REQUIRED)
  find_package (PFLOGGER REQUIRED)
  find_package (MAPL REQUIRED)
  include(mapl_acg) 

then it might work?

@climbfuji
Copy link
Collaborator

Thanks @mathomp4 ! I will try this for fargparse (this particular version doesn't have pflogger).

@climbfuji
Copy link
Collaborator

FARGPARSE

That seems tow ork, @mathomp4 . I suggest using this, however:

if (UFS_GOCART)
  find_package (GFTL_SHARED REQUIRED)
  find_package (FARGPARSE QUIET)
  find_package (PFLOGGER QUIET)
  find_package (MAPL REQUIRED)
  include(mapl_acg) 

because it will also work if mapl was compiled without fargparse or pflogger. If GOCART ends up use those libraries in the future, then QUIET would need to become REQUIRED.

@climbfuji
Copy link
Collaborator

@junwang-noaa I have a ufs-weather-model branch ready for you with changes in the ufs-weather-model repo itself, as well as in the GOCART submodule (submodule pointer and .gitmodules updated as usual).

git clone -b feature/esmf_850_mapl_2403 --recurse-submodules https://github.com/climbfuji/ufs-weather-model

The branch for GOCART is in my fork (https://github.com/climbfuji/GOCART - bugfix/mapl_cmake). @mathomp4 FYI

I didn't create any PRs yet, and the only test I ran was compiling the S2SWA application as per rt.conf on Hera with Intel:

> ./compile.sh hera "-DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8" "" intel YES NO 2>&1 | tee log.compile.hera.intel
...
Elapsed time 723 seconds. Compiling -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON finished

@mathomp4
Copy link

@climbfuji I was just quick copy-pasting. QUIET is probably smarter.

I suppose my hope is GEOS-ESM/MAPL#2320 would make it unneeded, though I guess belt-and-suspenders in GOCART is good (since MAPL versions might not match up).

@junwang-noaa
Copy link
Collaborator Author

@climbfuji @mathomp4 Thanks for looking into the issue and providing a solution. I will run a full RT on hera.

@climbfuji
Copy link
Collaborator

Note that scotch is also updated to 7.0.4 (from 7.0.3), which is a bug fix release only, especially it includes the MPI bug fix that Alex Richert tracked down so meticulously over many weeks.

@junwang-noaa
Copy link
Collaborator Author

@climbfuji All the coupled tests with aerosol failed even though the compile jobs ran fine. Is there any configuration variable change required?

/scratch1/NCEPDEV/nems/Jun.Wang/nems/vlab/20230621/esmf850/dom/ufs-weather-model/tests/RegressionTests_hera.log
/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_80583/cpld_control_p8_intel

171: *** Error in `/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_80583/cpld_control_p8_intel/./fv3.exe': double free or corruption (fasttop): 0x00000000212a64c0 ***
172: *** Error in `/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_80583/cpld_control_p8_intel/./fv3.exe': double free or corruption (fasttop): 0x00000000227694c0 ***
173: *** Error in `/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_80583/cpld_control_p8_intel/./fv3.exe': double free or corruption (fasttop): 0x0000000021afb4c0 ***
174: *** Error in `/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_80583/cpld_control_p8_intel/./fv3.exe': double free or corruption (fasttop): 0x00000000211b04c0 ***

@climbfuji
Copy link
Collaborator

This looks like a bug introduced by the new libraries (MAPL if only aerosol tests fail?). I am not aware of having to set any additional variables. But it would be good to test rerunning those tests without OpenMP (in case they use OpenMP)?

@mathomp4 FYI

@junwang-noaa
Copy link
Collaborator Author

The cpld_control_p8_intel uses single thread, so it might not be an OpenMP issue

@climbfuji
Copy link
Collaborator

Hmm. Probably a good idea to rebuild the ufs in debug mode and see what happens there. If that doesn't reveal anything, I need to build esmf and mapl in debug mode in spack-stack.

@climbfuji
Copy link
Collaborator

climbfuji commented Sep 13, 2023

@junwang-noaa I just updated my branch feature/esmf_850_mapl_2403. I was able to run rt.sh against the existing baseline, and all tests ran to completion. The following tests "failed" because of b4b differences in the results:

079 control_csawmg_intel failed in check_result
080 control_csawmgt_intel failed in check_result
081 control_ras_intel failed in check_result
085 control_CubedSphereGrid_debug_intel failed in check_result
086 control_wrtGauss_netcdf_parallel_debug_intel failed in check_result
087 control_stochy_debug_intel failed in check_result
088 control_lndp_debug_intel failed in check_result
089 control_csawmg_debug_intel failed in check_result
090 control_csawmgt_debug_intel failed in check_result
091 control_ras_debug_intel failed in check_result
092 control_diag_debug_intel failed in check_result
095 rap_control_debug_intel failed in check_result
096 hrrr_control_debug_intel failed in check_result
097 rap_unified_drag_suite_debug_intel failed in check_result
098 rap_diag_debug_intel failed in check_result
099 rap_cires_ugwp_debug_intel failed in check_result
100 rap_unified_ugwp_debug_intel failed in check_result
101 rap_lndp_debug_intel failed in check_result
102 rap_progcld_thompson_debug_intel failed in check_result
103 rap_noah_debug_intel failed in check_result
104 rap_sfcdiff_debug_intel failed in check_result
105 rap_noah_sfcdiff_cires_ugwp_debug_intel failed in check_result
106 rrfs_v1beta_debug_intel failed in check_result
107 rap_clm_lake_debug_intel failed in check_result
108 rap_flake_debug_intel failed in check_result
109 control_wam_debug_intel failed in check_result
123 rap_control_dyn64_phy32_intel failed in check_result

This is not surprising, given the version updates for several packages. See the diff for modulefiles/ufs_common.lua. I also removed the unnecessary paths to miniconda, not needed for running rt.sh / compiling the model.

My branch uses a temporary/test install of spack-stack that must never make it into the submitted code. The order of operations should be as follows, in my opinion.

  1. After spack-stack-1.5.0 is released, update the ufs-weather-model to it. This will bump versions for ip, scotch, and maybe one other NCEP library, but not yet esmf/mapl (spack-stack-1.5.0 uses 8.4.2 / 2.35.2). I suspect that there will be no b4b differences, but we will find out. As part of this PR, also remove the unnecessary miniconda paths from the environments on all platforms (where possible).
  2. As a follow-up, we roll out esmf 8.5.0 and mapl-2.40.3 as updates to spack-stack-1.5.0 and then the UFS moves to those versions in a separate PR. This should show the same b4b differences that I am seeing here.

Update. My local ufs-weather-model dir on hera is /scratch1/NCEPDEV/da/Dom.Heinzeller/ufs-weather-model-esmf850/

@junwang-noaa
Copy link
Collaborator Author

The plan is good. How about fms 2023.02? Will it be included in the spack-stack 1.5.0?

@climbfuji
Copy link
Collaborator

Yes, that is already part of spack-stack-1.5.0. That should be a third PR I suppose?

@junwang-noaa
Copy link
Collaborator Author

I assume it does not change results. Maybe we can combine it with other non-results change PR.

@Hang-Lei-NOAA FYI, would you please make requests to NCO for the updated libraries in spack-stack 1.5.0 on wcoss2? I understand they may have to be in hpc-stack before the spack-stack transition. But it would be good we have the same versions in hpc-stack on wcoss2.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Sep 13, 2023 via email

@climbfuji
Copy link
Collaborator

@junwang-noaa @mathomp4 Please see PR #1920

@uturuncoglu
Copy link
Collaborator

@junwang-noaa Do we have any time estimate about updating ESMF at this point? I was thinking that spack-stack-1.5.0 will include ESMF 8.5.0 but maybe I missed something. Thanks,

@junwang-noaa
Copy link
Collaborator Author

My understanding is that the ESMF 8.5.0, MAPL 2.40.0 and FMS 2023.02.01 will be in spack-stack 1.5.1. @climbfuji Do you have a timeline when spack-stack 1.5.1 will be out? I assume EPIC will then build it on all the platforms and create a PR to the ufs weather model..

@climbfuji
Copy link
Collaborator

My understanding is that the ESMF 8.5.0, MAPL 2.40.0 and FMS 2023.02.01 will be in spack-stack 1.5.1. @climbfuji Do you have a timeline when spack-stack 1.5.1 will be out? I assume EPIC will then build it on all the platforms and create a PR to the ufs weather model..

Yes, it's getting installed as we "speak". The progress is recorded here: JCSDA/spack-stack#819

@uturuncoglu
Copy link
Collaborator

@climbfuji @junwang-noaa Thanks for the information. That would be great to have it since both land PR and also land DA work that aim to use the component model depends on it. Once 1.5.1 is available I'll sync the land model PR. Thanks again.

@uturuncoglu
Copy link
Collaborator

@junwang-noaa @climbfuji It seems that 1.5.1 release is created in spack-stack side but I could not see any PR in the model side to update ESMF. Any update about it?

@climbfuji
Copy link
Collaborator

We are ready to do that anytime. But there was a desire to combine this with an update of FMS, even though that's not necessary, and we/I haven't heard back about the progress of the FMS testing.

@uturuncoglu
Copy link
Collaborator

@climbfuji I prefer not to delay this more since land PR is still waiting for it but if there is an intention to have updated version of FMS too. That is fine. I think we could wait little bit more if it won't delay additional month etc. Are you plaining to release 1.5.2 with updated FMS and then open a PR in model side? Thanks for your help.

@climbfuji
Copy link
Collaborator

climbfuji commented Nov 13, 2023 via email

@climbfuji
Copy link
Collaborator

@uturuncoglu @junwang-noaa @jkbk2004 I just talked with Jun, the fms library is compiled correctly, so we can go ahead and update to spack-stack@1.5.1 with associated version updates of esmf to 8.5.0, mapl to 2.40.3, and fms to 2023.03 as soon as we have it on all platforms (see JCSDA/spack-stack#860). It's my understanding that EPIC will issue the ufs-weather-model PR after EPIC/JCSDA/EMC completed the fms@2023.03 install on all the platforms, correct?

@jkbk2004
Copy link
Collaborator

Yes, PR should be prepared on EPIC side to follow up library update. Note dependent PR #1845 and WCOSS2 dependency as well. @BrianCurtis-NOAA @zach1221 @FernandoAndrade-NOAA FYI

@uturuncoglu
Copy link
Collaborator

@jkbk2004 Let is great to know that we are going forward and update the model to recent ESMf version. Please let me know, if you need anything from my side. Once that is in, I'll sync the land PR and I hope we could start working on it.

@uturuncoglu
Copy link
Collaborator

@jkbk2004 @BrianCurtis-NOAA @zach1221 @FernandoAndrade-NOAA I just wonder if there is any update about this? Thanks for your help. Best.

@climbfuji
Copy link
Collaborator

FYI fms@2023.03 is available on all platforms

@uturuncoglu
Copy link
Collaborator

@jkbk2004 I wonder if you have any update about the ESMF upgrade? Do we have any timeframe for it? As I know the libraries are already installed. Thanks for your help.

@jkbk2004
Copy link
Collaborator

@jkbk2004 I wonder if you have any update about the ESMF upgrade? Do we have any timeframe for it? As I know the libraries are already installed. Thanks for your help.

@uturuncoglu Sorry for delay. We plan to start testing spack stack 1.5.1 update from this week. But note that we are a bit cautious to avoid any chance of divergence with wcoss2 side.

@uturuncoglu
Copy link
Collaborator

@jkbk2004 No worries. Thanks again all of your help. I know you are all busy with different things but this really delayed too much land PR. I am not sure about the exact issue in wcoss2 but are you expecting any further delay for update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment