Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rt.sh does not report on success/failure of compile jobs #368

Closed
climbfuji opened this issue Jan 13, 2021 · 3 comments · Fixed by #378
Closed

rt.sh does not report on success/failure of compile jobs #368

climbfuji opened this issue Jan 13, 2021 · 3 comments · Fixed by #378
Assignees
Labels
bug Something isn't working

Comments

@climbfuji
Copy link
Collaborator

Description

When rt.sh is used to run the regression tests, compile jobs that fail go unnoticed unless there are regression tests following that depend on the executable from that compile job.

To Reproduce:

With the current hash of develop (6daad90), run the GNU regression tests on hera.

The compile jobs for the coupled model fail due to recent code changes in CMEPS (that are incompatible with GNU but work with Intel). These errors go unnoticed, because no coupled model tests are run.

Additional context

The error is (thanks to @pjpegion for reporting this):

/scratch2/BMC/gsienkf/Philip.Pegion/ufs-weather-model-develop/CMEPS-interface/CMEPS/mediator/esmFldsExchange_nems_mod.F90:87:46:

   87 |                'Sa_u10m','Sa_v10m', 'Sa_t2m ', 'Sa_q2m'/)
      |                                              1
Error: Different CHARACTER lengths (7/6) in array constructor at (1)
make[2]: *** [CMEPS-interface/CMakeFiles/cmeps.dir/CMEPS/mediator/esmFldsExchange_nems_mod.F90.o] Error 1

The solution for this particular problem is to pad the entries with trailing whitespaces that they all have the same length, or alternatively use something like

attrList=(/character(11)::"fhzero", "ncld", "nsoil", "imp_physics", "dtp"/), rc=rc)
@climbfuji climbfuji added the bug Something isn't working label Jan 13, 2021
@DusanJovic-NOAA
Copy link
Collaborator

When job fails, slurm reports job status as '-' (unknown):

Job id 15319645
TEST 10 compile is waiting to enter the queue
TEST 10 compile is submitted 
1 min. TEST 10 compile is pending,  status: PD jobid 15319645
2 min. TEST 10 compile is running,  status: R jobid 15319645
Slurm unknown status -. Check sacct ...
15319645                   FAILED           compile_10 
15319645.ba+               FAILED                batch 
15319645.ex+            COMPLETED               extern 
3 min. TEST 10 compile is FAILED,  status: - jobid 15319645

in such cases we must check the status_label that sacct returns for a given jobid and manually set test_status.

This change will do the trick:

$ git diff rt_utils.sh
diff --git a/tests/rt_utils.sh b/tests/rt_utils.sh
index acb1b87..c58d775 100755
--- a/tests/rt_utils.sh
+++ b/tests/rt_utils.sh
@@ -154,6 +154,9 @@ submit_and_wait() {
         echo "Slurm unknown status ${status}. Check sacct ..."
         sacct -n -j ${slurm_id} --format=JobID,state%20,Jobname%20
         status_label=$( sacct -n -j ${slurm_id} --format=JobID,state%20,Jobname%20 | grep "^${slurm_id}" | grep ${JBNME} | awk '{print $2}' )
+        if [[ $status_label = 'FAILED' ]]; then
+            test_status='FAIL'
+        fi
       fi
 
     elif [[ $SCHEDULER = 'lsf' ]]; then

@climbfuji
Copy link
Collaborator Author

That looks good! We should add this to one of the next PRs.

@DusanJovic-NOAA
Copy link
Collaborator

This is not the first time (and certainly will not be the last) this kind of error caused GNU build to fail. So, I suggest we make the Intel compiler fail as well on this particular error, by adding:
-std -diag-error=8208
to all CMAKE_Fortran_FLAGS in all the components we control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants