inspect_failed_jobs.py
helper script for batch-diagnosing job failures
#147
janosh
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
in case it helps anyone else, here's a Python script that fetches the last
--lines
ofstd_err.txt
,queue.err
andcustodian.json
for the--njobs
most recently failed jobs and renders them to the terminable or more searchable HTML (when passing--html
). the intended use case is to get a quick idea of the prevalence of different kinds of failure modes.has to be run on the cluster where the jobs were ran in order to read the above files.
Example invocation
Example output
Next Steps
not sure yet, having a GUI like
fireworks
would be nice but this very far from that and uses just static HTML (i.e. a very different approach though much more than this 1st step could already be done with that)inspect_failed_jobs.py
Beta Was this translation helpful? Give feedback.
All reactions