-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling in analcalc does not abort script #241
Comments
gdasanalcalc.log.0 loads prod_util/1.1.0 from /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles. Module prod_util/1.1.0 defines the path, /scratch2/NCEPDEV/nwprod/NCEPLIBS/utils/prod_util.v1.1.0/ush/, to err_chk and other production ush scripts. Printout in gdasanalcalc.log.0 shows that err_chk is working as designed on Hera. The problem is in err_exit. The
According to online documentation the -n test checks if $LSB_JOBID has nonzero size. Variable LSB_JOBID is not defined in gdasanalcalc.log.0 Thus, the logical test returns .false. and execution does not enter the block. The scancel command is not executed. Even if execution entered the block, the scancel line would not kill the job. LSB_JOBID does not contain the jobid on Hera. err_exit should reference SLURM_JOBID on Hera. The job kill line in err_exit is unique to each machine and scheduler. WCOSS2 uses To see if exglobal_atmos_analysis_calc.sh properly handles error conditions (at least one type of error condition), the following test was run on Venus using operational input and the operational workflow. gdasanalcal was submitting for 2021101900 with all required input present. The job ran to completion and reproduced operations. File gdas.t00z.atminc.nc was removed and the job resubmitted. calcanl_gfs.py aborted upon detection of the missing input file
err_chk was executed
followed by err_exit
The job was correctly terminated upon encountering a non-zero return code. |
@CatherineThomas-NOAA @RussTreadon-NOAA |
@aerorahul , this is ancient history to me. We should confirm that this is still a problem. If it isn't, we close this GSI issue and we're done. If it remains a problem, we still close this GSI issue and open a g-w issue as you note. |
Thanks @RussTreadon-NOAA |
There have been a few instances in the low resolution parallels for v16.x on Hera where one of the parts of analcalc fails, but the job continues and ends "successfully". For example, the log files at /scratch1/NCEPDEV/da/Catherine.Thomas/v16x/analcalc show failure of calc_analysis.x for the deterministic:
However, the script does not exit, it continues on with the surface analysis:
These gdasanalcalc failures resolve upon a rerun of the job and are not reproducible.
I don't see anything different in the error handling for the calc_analysis part compared to other parts of the code. It uses err_chk, added when addressing NCO's bugzillas in v16.1 (#137).
The text was updated successfully, but these errors were encountered: