Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C384 P7 coupled tests fail on cheyenne #698

Closed
DeniseWorthen opened this issue Jul 20, 2021 · 16 comments · Fixed by #765, NOAA-EMC/fv3atm#398, NCAR/ccpp-physics#743 or #831
Closed

C384 P7 coupled tests fail on cheyenne #698

DeniseWorthen opened this issue Jul 20, 2021 · 16 comments · Fixed by #765, NOAA-EMC/fv3atm#398, NCAR/ccpp-physics#743 or #831
Assignees
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

Description

This job fails to run at startup with an error: MPT: shepherd terminated: r5i4n4.ib0.cheyenne.ucar.edu - job aborting

To Reproduce:

Cheyenne.intel

Additional context

Found during testing for PR #639. The test was turned off for cheyenne.intel until the issue can be resolved.

Since the wave model cannot be compile in debug mode, an equivalent test was created for a non-wave bmark_p7b configuration. The test can be accessed using rt.test in this branch

When running in debug mode, the model fails with the following:

2:MPT:     header=header@entry=0x7ffdcfe45c10 "MPT ERROR: Rank 32(g:32) received signal SIGFPE(8).\n\tProcess ID: 62280, Host: r7i7n1, Program: /glade/scratch/worthen/FV3_RT/rt_13027/cpld_bmark_v16_p7b/fv3.exe\n\tMPT Version: HPE MPT 2.22  03/31/20 15"...) at sig.c:340
32:MPT: #3  0x00002b177c55e4ff in first_arriver_handler (signo=signo@entry=8,
32:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b1786c00080) at sig.c:489
32:MPT: #4  0x00002b177c55e793 in slave_sig_handler (signo=8, siginfo=<optimized out>,
32:MPT:     extra=<optimized out>) at sig.c:565
32:MPT: #5  <signal handler called>
32:MPT: #6  0x0000000008047c67 in module_sf_noahmplsm::energy (parameters=..., ice=0,
32:MPT:     vegtyp=17, ist=1, nsnow=3, nsoil=4, isnow=0, dt=300,
32:MPT:     rhoair=1.1787577636261917, sfcprs=101003.09822185335,
32:MPT:     qair=0.011852218429786931, sfctmp=296.38129134030879,
32:MPT:     thair=296.38129134030879, lwdn=345.52102043841046, uu=3.4295500034750659,
32:MPT:     vv=-3.4672281739002062, zref=10.812501434103201,
32:MPT:     co2air=39.896223797632075, o2air=21109.64752836735, solad=..., solai=...,
32:MPT:     cosz=-0.91289500682162039, igs=1, eair=1910.8519305175637,
32:MPT:     tbot=299.26776123046875, zsnso=..., zsoil=..., elai=0, esai=0, fwet=0,
32:MPT:     foln=1, fveg=0, pahv=0, pahg=0, pahb=0, qsnow=0, dzsnso=...,
32:MPT:     lat=0.21661703129464016, canliq=0, canice=0, iloc=6, jloc=-9999,
32:MPT:     z0wrf=nan(0x7baddadbaddad), imelt=..., snicev=..., snliqv=..., epore=...,
32:MPT:     t2m=nan(0x7baddadbaddad), fsno=0, sav=0, sag=0,
32:MPT:     qmelt=nan(0x7baddadbaddad), fsa=0, fsr=0, taux=nan(0x7baddadbaddad),
32:MPT:     tauy=nan(0x7baddadbaddad), fira=nan(0x7baddadbaddad),
32:MPT:     fsh=nan(0x7baddadbaddad), fcev=nan(0x7baddadbaddad),
32:MPT:     fgev=nan(0x7baddadbaddad), fctr=nan(0x7baddadbaddad),
32:MPT:     trad=nan(0x7baddadbaddad), psn=nan(0x7baddadbaddad),
32:MPT:     apar=nan(0x7baddadbaddad), ssoil=nan(0x7baddadbaddad), btrani=...,
32:MPT:     btran=9.9999999999999995e-07, ponding=nan(0x7baddadbaddad),
32:MPT:     ts=nan(0x7baddadbaddad), latheav=nan(0x7baddadbaddad),
32:MPT:     latheag=nan(0x7baddadbaddad), frozen_canopy=3435973836,
32:MPT:     frozen_ground=3435973836, tv=291.71072387695312, tg=291.71072387695312,
32:MPT:     stc=..., snowh=0, eah=2000, tah=291.71072387695312, sneqvo=0, sneqv=0,
32:MPT:     sh2o=..., smc=..., snice=..., snliq=..., albold=0.65000000000000002, cm=0,
32:MPT:     ch=0, dx=-9999, dz8w=-9999, q2=0.011852218429786931, tauss=0, laisun=0,
32:MPT:     laisha=0, rb=0, errmsg=..., errflg=0, qc=-9999, qsfc=9.99e+20,
32:MPT:     psfc=101128.33749723693, t2mv=0, t2mb=nan(0x7baddadbaddad), fsrv=0,
32:MPT:     fsrg=0, rssun=nan(0x7baddadbaddad), rssha=nan(0x7baddadbaddad), albd=...,
32:MPT:     albi=..., albsnd=..., albsni=..., bgap=0, wgap=0,
32:MPT:     tgv=nan(0x7baddadbaddad), tgb=nan(0x7baddadbaddad),
32:MPT:     q1=nan(0x7baddadbaddad), q2v=0, q2b=nan(0x7baddadbaddad),
32:MPT:     q2e=nan(0x7baddadbaddad), chv=0, chb=nan(0x7baddadbaddad),
32:MPT:     emissi=nan(0x7baddadbaddad), pah=0, shg=0, shc=0,
32:MPT:     shb=nan(0x7baddadbaddad), evg=0, evb=nan(0x7baddadbaddad), ghv=0,
32:MPT:     ghb=nan(0x7baddadbaddad), irg=0, irc=0, irb=nan(0x7baddadbaddad), tr=0,
32:MPT:     evc=0, chleaf=0, chuc=0, chv2=0, chb2=nan(0x7baddadbaddad),
32:MPT:     .tmp.ERRMSG.len_V$698=512)
@DeniseWorthen DeniseWorthen added the bug Something isn't working label Jul 20, 2021
@climbfuji
Copy link
Collaborator

There is an awful number of NaNs in that stack trace. Let's try to figure it out today between my meetings.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Aug 20, 2021

While testing restarts (ie, no waves) for the upcoming RTs for the coupled model with P7 configuration, I have been able to run the c192 coupled P7 test on Cheyenne intel (non-wave). However, the c384 P7 case (non-wave) still fails with the MPT error.

@DeniseWorthen
Copy link
Collaborator Author

The C384 P7 coupled cases still fail on Cheyenne for PR #765. We now have a standalone P7 test so I will make a C384 version of this to see if the error is reproducible.

@DeniseWorthen DeniseWorthen changed the title cpld_bmark_wave_v16_p7b fails on cheyenne.intel C384 P7 coupled tests fail on cheyenne Sep 30, 2021
@DeniseWorthen
Copy link
Collaborator Author

I created a control_c384_p7 test and it runs to completion on Cheyenne.intel.

@DeniseWorthen
Copy link
Collaborator Author

Re-opening.

@DeniseWorthen DeniseWorthen reopened this Oct 7, 2021
@DeniseWorthen
Copy link
Collaborator Author

I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by turning off merra2. In input.nml:

iaer = 1011 -> iaer = 5111

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 7, 2021 via email

@DeniseWorthen
Copy link
Collaborator Author

Yes, I have the aeroclim.m*.nc files in the run directory.

@climbfuji
Copy link
Collaborator

climbfuji commented Oct 7, 2021 via email

@DeniseWorthen
Copy link
Collaborator Author

I went back and checked the commit history. The test first failed after we converted the v16 test to a v16_p7b test.

@climbfuji
Copy link
Collaborator

climbfuji commented Oct 7, 2021 via email

@DeniseWorthen
Copy link
Collaborator Author

I first found that turning it off w/ gnu-debug worked. I will test w/ the non-debug version.

@DeniseWorthen
Copy link
Collaborator Author

I used both intel and gnu in non-debug mode. First I ran the standard cpld_control_c384_p7 test and they both failed with the MPT error. I copied the run directories and changed iaer to 5111. Both cases ran. The run directories are on cheyenne:

/glade/scratch/worthen/FV3_RT/c384_test_gnu and c384_test_intel

The gnu case timed out but did run all the way to the fh=6.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 8, 2021 via email

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Oct 8, 2021

It appears that increasing the memory available on Cheyenne allows the cpld_control_c384_p7 to run with MERRA2 turned on (iaer=1011).

I made this change in the job_card:

< #PBS -l select=27:ncpus=18:mpiprocs=18
---
> #PBS -l select=14:ncpus=36:mpiprocs=36

...

mpiexec_mpt -p %g: -np 480 ./fv3.exe

I'm not sure that is entirely the right way but it does complete the 6 hours for intel.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Oct 9, 2021

I believe the issue is the memory footprint of the ingested Merra2 data. The values are stored in the files as float (r4), but when read-in are immediately promoted to double precision. They are then interpolated in time and space for use by the ATM. If the ingested values are kept as single precision and promoted to double precision when they are interpolated, the model runs on Cheyenne with the default resources.

I have tested the following change on both cheyenne.intel and hera.intel. On hera.intel all baselines pass against develop-20211006

diff --git a/physics/aerclm_def.F b/physics/aerclm_def.F
index 157c7b96..e6682527 100644
--- a/physics/aerclm_def.F
+++ b/physics/aerclm_def.F
@@ -1,5 +1,5 @@
       module aerclm_def
-      use machine , only : kind_phys
+      use machine , only : kind_phys, kind_io4
       implicit none

       integer, parameter   :: levsaer=72, ntrcaerm=15, timeaer=12
@@ -10,8 +10,8 @@

       real (kind=kind_phys), allocatable, dimension(:) :: aer_lat
       real (kind=kind_phys), allocatable, dimension(:) :: aer_lon
-      real (kind=kind_phys), allocatable, dimension(:,:,:,:) :: aer_pres
-      real (kind=kind_phys), allocatable, dimension(:,:,:,:,:) :: aerin
+      real (kind=kind_io4),  allocatable, dimension(:,:,:,:) :: aer_pres
+      real (kind=kind_io4),  allocatable, dimension(:,:,:,:,:) :: aerin

       data aer_time/15.5, 45.,  74.5,  105., 135.5, 166., 196.5,
      &             227.5, 258., 288.5, 319., 349.5, 380.5/
diff --git a/physics/aerinterp.F90 b/physics/aerinterp.F90
index dbcf7360..4b3232ab 100644
--- a/physics/aerinterp.F90
+++ b/physics/aerinterp.F90
@@ -181,7 +181,7 @@ contains
              endif
              do i = iamin, iamax
                aerin(i,j,k,ii,imon) = 1.d0*buffx(i,j,klev,1)
-               if(aerin(i,j,k,ii,imon) < 0 .or. aerin(i,j,k,ii,imon) > 1.)  then
+               if(aerin(i,j,k,ii,imon) < 0. .or. aerin(i,j,k,ii,imon) > 1.)  then
                  aerin(i,j,k,ii,imon) = 1.e-15
                endif
              enddo   !i-loop (lon)

In testing, I found that the diag_table_template for the coupled model was not updated correctly in PR #765. Fixing the diag_template is expected to break the P7 tests because of added fields.

epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
* Workflow in python starting to work.

* Use new python_utils package structure.

* Some bug fixes.

* Use uppercase TRUE/FALSE in var_dfns

* Use config.sh by default.

* Minor bug fixes.

* Remove config.yaml

* Update to the latest develop

* Remove quotes from numbers in predef grid.

* Minor bug fix.

* Move validity checker to the bottom of setup

* Add more unit tests.

* Update with python_utils changes.

* Update to latest develop additions (Need to re-run regression test)

* Use set_namelist and fill_jinja_template as python functions.

* Replace sed regex searches with python re.

* Use python realpath.

* Construct settings as dictionary before passing to fill_jinja and set_namelist

* Use yaml for setting predefined grid parameters.

* Use xml parser for ccpp phys suite definition file.

* Remove more run_command calls.

* Simplify some func argument processing.

* Move different config format parsers to same file.

* Use os.path.join for the sake of macosx

* Remove remaining func argument processing via os.environ.

* Minor bug fix in set_extrn_mdl_params.sh

* Add suite defn in test_data.

* Minor fixes on unittest on jet.

* Simplify boolean condition checks.

* Include old in renaming of old directories

* Fix conflicting yaml !join tag for paths and strings.

* Bug fix with setting sfcperst dict.

* Imitate "readlink -m" with os.path.realpath instead of os.readlink

* Don't use /tmp as that is shared by multiple users.

* Bug fix with cron line, maintain quotes around TRUE/FALSE.

* Update to latest develop (untested)

* Bug fix with existing cron line and quotes.

* Bug fix with case-sensitive MACHINE name, and empty EXPT_DIR.

* Update to latest develop

* More updates.

* Bug fix thanks to @willmayfield! Check both starting/ending
characters are brackets for shell variable to be considered an array.

* Make empty EXPT_BASEDIR workable.

* Update to latest develop

* Update in predef grid.

* Check f90nml as well.

Co-authored-by: Daniel Abdi <dabdi@Orion-login-2.HPC.MsState.Edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment