Error handling in analcalc does not abort script #241

CatherineThomas-NOAA · 2021-10-21T11:48:34Z

There have been a few instances in the low resolution parallels for v16.x on Hera where one of the parts of analcalc fails, but the job continues and ends "successfully". For example, the log files at /scratch1/NCEPDEV/da/Catherine.Thomas/v16x/analcalc show failure of calc_analysis.x for the deterministic:

srun: error: Unable to create step for job 24003915: Requested node configuration is not available
Error with calc_analysis.x for deterministic resolution, exit code=1
...
20.093 + err=1
20.093 + export err
20.093 + err_chk

*************************************************************
** FATAL ERROR: Job anal.307018 failed RETURN CODE 1
** ABNORMAL EXIT at Sat Oct  9 00:41:04 UTC 2021 on h4c05
*************************************************************

However, the script does not exit, it continues on with the surface analysis:

20.267 + [ YES '=' YES ]
20.267 + APRUNSFC='srun --export=ALL -n 1'
20.267 + export APRUNSFC
20.267 + OMP_NUM_THREADS_SFC=1
20.267 + export OMP_NUM_THREADS_SFC
20.267 + /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201222_v161/ush/gaussian_sfcanl.sh
Sat Oct 9 00:41:04 UTC 2021 EXECUTING /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201222_v161/ush/gaussian_sfcanl.sh

These gdasanalcalc failures resolve upon a rerun of the job and are not reproducible.

I don't see anything different in the error handling for the calc_analysis part compared to other parts of the code. It uses err_chk, added when addressing NCO's bugzillas in v16.1 (#137).

The text was updated successfully, but these errors were encountered:

RussTreadon-NOAA · 2021-10-21T16:02:40Z

gdasanalcalc.log.0 loads prod_util/1.1.0 from /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles.

Module prod_util/1.1.0 defines the path, /scratch2/NCEPDEV/nwprod/NCEPLIBS/utils/prod_util.v1.1.0/ush/, to err_chk and other production ush scripts.

Printout in gdasanalcalc.log.0 shows that err_chk is working as designed on Hera. The problem is in err_exit.

The err -ne 0 block of err_chk ends with err_exit $@. As err_exit is executed it generates the traceback information printed to gdasanalcalc.log.0. This is OK. The following logical test in err_exit is problematic on Hera

if [ -n "$LSB_JOBID" ]; then
 if [ "$SENDECF" = "YES" ]; then
    timeout 30 ecflow_client --msg "$ECF_NAM
...
 fi
#  bkill $LSB_JOBID
  scancel $LSB_JOBID
  sleep 60  # Wait for LSF to kill the job
fi

According to online documentation the -n test checks if $LSB_JOBID has nonzero size. Variable LSB_JOBID is not defined in gdasanalcalc.log.0 Thus, the logical test returns .false. and execution does not enter the block. The scancel command is not executed. Even if execution entered the block, the scancel line would not kill the job. LSB_JOBID does not contain the jobid on Hera. err_exit should reference SLURM_JOBID on Hera.

The job kill line in err_exit is unique to each machine and scheduler. WCOSS2 uses qdel $PDS_JOBID. WCOSS_D uses bkill $LSB_JOBID. Hera should use scancel $SLURM_JOBID

To see if exglobal_atmos_analysis_calc.sh properly handles error conditions (at least one type of error condition), the following test was run on Venus using operational input and the operational workflow.

gdasanalcal was submitting for 2021101900 with all required input present. The job ran to completion and reproduced operations.

File gdas.t00z.atminc.nc was removed and the job resubmitted. calcanl_gfs.py aborted upon detection of the missing input file

  File "netCDF4/_netCDF4.pyx", line 1731, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'siginc.nc'
0.591 + err=1
0.591 + export err

err_chk was executed

0.591 + err_chk

*************************************************************
** FATAL ERROR: Job v16ops_gdasanalcalc_00.71198453 failed RETURN CODE 1
** ABNORMAL EXIT at Thu Oct 21 15:46:36 UTC 2021 on v100c43f
*************************************************************


Currently Loaded Modules:

followed by err_exit

drwxr-xr-x 2 emc.glopara g01 512 Oct 21 15:46 calcanl_ensres_06
cat: OUTPUT.74149: No such file or directory
Job <71198453> is being terminated

------------------------------------------------------------
Sender: LSF System <lsfadmin@v100c43f>
Subject: Job 71198453: <v16ops_gdasanalcalc_00> in cluster <venus> Exited

The job was correctly terminated upon encountering a non-zero return code.

aerorahul · 2022-07-29T14:03:55Z

@CatherineThomas-NOAA @RussTreadon-NOAA
I imagine this is still an issue. Can we move this to the global-workflow since the jobs and scripts now live there?

RussTreadon-NOAA · 2022-07-29T14:10:32Z

@aerorahul , this is ancient history to me. We should confirm that this is still a problem. If it isn't, we close this GSI issue and we're done. If it remains a problem, we still close this GSI issue and open a g-w issue as you note.

aerorahul · 2022-07-29T14:12:14Z

Thanks @RussTreadon-NOAA
I will close this issue and if this problem persists, please open one in global-workflow.

MichaelLueken assigned CatherineThomas-NOAA Oct 21, 2021

DavidHuber-NOAA mentioned this issue Jan 26, 2022

exglobal_atmos_analysis.sh fails to exit on crash of global_gsi.x #293

Closed

aerorahul closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling in analcalc does not abort script #241

Error handling in analcalc does not abort script #241

CatherineThomas-NOAA commented Oct 21, 2021

RussTreadon-NOAA commented Oct 21, 2021

aerorahul commented Jul 29, 2022

RussTreadon-NOAA commented Jul 29, 2022

aerorahul commented Jul 29, 2022

Error handling in analcalc does not abort script #241

Error handling in analcalc does not abort script #241

Comments

CatherineThomas-NOAA commented Oct 21, 2021

RussTreadon-NOAA commented Oct 21, 2021

aerorahul commented Jul 29, 2022

RussTreadon-NOAA commented Jul 29, 2022

aerorahul commented Jul 29, 2022