Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling in analcalc does not abort script #241

Closed
CatherineThomas-NOAA opened this issue Oct 21, 2021 · 4 comments
Closed

Error handling in analcalc does not abort script #241

CatherineThomas-NOAA opened this issue Oct 21, 2021 · 4 comments
Assignees

Comments

@CatherineThomas-NOAA
Copy link
Collaborator

There have been a few instances in the low resolution parallels for v16.x on Hera where one of the parts of analcalc fails, but the job continues and ends "successfully". For example, the log files at /scratch1/NCEPDEV/da/Catherine.Thomas/v16x/analcalc show failure of calc_analysis.x for the deterministic:

srun: error: Unable to create step for job 24003915: Requested node configuration is not available
Error with calc_analysis.x for deterministic resolution, exit code=1
...
20.093 + err=1
20.093 + export err
20.093 + err_chk

*************************************************************
** FATAL ERROR: Job anal.307018 failed RETURN CODE 1
** ABNORMAL EXIT at Sat Oct  9 00:41:04 UTC 2021 on h4c05
*************************************************************

However, the script does not exit, it continues on with the surface analysis:

20.267 + [ YES '=' YES ]
20.267 + APRUNSFC='srun --export=ALL -n 1'
20.267 + export APRUNSFC
20.267 + OMP_NUM_THREADS_SFC=1
20.267 + export OMP_NUM_THREADS_SFC
20.267 + /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201222_v161/ush/gaussian_sfcanl.sh
Sat Oct 9 00:41:04 UTC 2021 EXECUTING /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201222_v161/ush/gaussian_sfcanl.sh

These gdasanalcalc failures resolve upon a rerun of the job and are not reproducible.

I don't see anything different in the error handling for the calc_analysis part compared to other parts of the code. It uses err_chk, added when addressing NCO's bugzillas in v16.1 (#137).

@RussTreadon-NOAA
Copy link
Contributor

gdasanalcalc.log.0 loads prod_util/1.1.0 from /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles.

Module prod_util/1.1.0 defines the path, /scratch2/NCEPDEV/nwprod/NCEPLIBS/utils/prod_util.v1.1.0/ush/, to err_chk and other production ush scripts.

Printout in gdasanalcalc.log.0 shows that err_chk is working as designed on Hera. The problem is in err_exit.

The err -ne 0 block of err_chk ends with err_exit $@. As err_exit is executed it generates the traceback information printed to gdasanalcalc.log.0. This is OK. The following logical test in err_exit is problematic on Hera

if [ -n "$LSB_JOBID" ]; then
 if [ "$SENDECF" = "YES" ]; then
    timeout 30 ecflow_client --msg "$ECF_NAM
...
 fi
#  bkill $LSB_JOBID
  scancel $LSB_JOBID
  sleep 60  # Wait for LSF to kill the job
fi 

According to online documentation the -n test checks if $LSB_JOBID has nonzero size. Variable LSB_JOBID is not defined in gdasanalcalc.log.0 Thus, the logical test returns .false. and execution does not enter the block. The scancel command is not executed. Even if execution entered the block, the scancel line would not kill the job. LSB_JOBID does not contain the jobid on Hera. err_exit should reference SLURM_JOBID on Hera.

The job kill line in err_exit is unique to each machine and scheduler. WCOSS2 uses qdel $PDS_JOBID. WCOSS_D uses bkill $LSB_JOBID. Hera should use scancel $SLURM_JOBID

To see if exglobal_atmos_analysis_calc.sh properly handles error conditions (at least one type of error condition), the following test was run on Venus using operational input and the operational workflow.

gdasanalcal was submitting for 2021101900 with all required input present. The job ran to completion and reproduced operations.

File gdas.t00z.atminc.nc was removed and the job resubmitted. calcanl_gfs.py aborted upon detection of the missing input file

  File "netCDF4/_netCDF4.pyx", line 1731, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'siginc.nc'
0.591 + err=1
0.591 + export err

err_chk was executed

0.591 + err_chk

*************************************************************
** FATAL ERROR: Job v16ops_gdasanalcalc_00.71198453 failed RETURN CODE 1
** ABNORMAL EXIT at Thu Oct 21 15:46:36 UTC 2021 on v100c43f
*************************************************************


Currently Loaded Modules:

followed by err_exit

drwxr-xr-x 2 emc.glopara g01 512 Oct 21 15:46 calcanl_ensres_06
cat: OUTPUT.74149: No such file or directory
Job <71198453> is being terminated

------------------------------------------------------------
Sender: LSF System <lsfadmin@v100c43f>
Subject: Job 71198453: <v16ops_gdasanalcalc_00> in cluster <venus> Exited

The job was correctly terminated upon encountering a non-zero return code.

@aerorahul
Copy link
Contributor

@CatherineThomas-NOAA @RussTreadon-NOAA
I imagine this is still an issue. Can we move this to the global-workflow since the jobs and scripts now live there?

@RussTreadon-NOAA
Copy link
Contributor

@aerorahul , this is ancient history to me. We should confirm that this is still a problem. If it isn't, we close this GSI issue and we're done. If it remains a problem, we still close this GSI issue and open a g-w issue as you note.

@aerorahul
Copy link
Contributor

Thanks @RussTreadon-NOAA
I will close this issue and if this problem persists, please open one in global-workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants