Taxonomy of failure modes #24

albgar · 2019-05-29T10:27:27Z

It would be good to have a list of possible "failure modes" in a Siesta calculation, classified according to their severity and potential for recovery. This information can then be encoded in the plugin/parser and used by the workflows.

We have already identified a broad-brush classification scheme: those errors which occur before a CML file is produced should result in an "Excepted" state. Others might be given a "Failed" state with an appropriate exit code. (Note: we should catch the former before attempting to parse the CML file, and provide a proper error message.)

vdikan · 2019-06-03T09:28:18Z

I think it's already partly there: if no CML is produced, the calculation will have Excepted status with verdi process logshow showing contents of Siesta's Message file. What could be improved:

the python stacktrace in case of no CML will always say that, well, there is no CML. Instead we can throw exception with the same Message file contents, eg. in the try-catch block.
implement some unrecognized_by_parser error handling procedure, like what goes in the bottom of aiida_quantumespresso/workflows/pw/base.py under exit_code 100: in case we do have CML produced but the final result is not there, and we do not handle this scenario yet.

vdikan · 2019-06-03T09:52:16Z

As a bookmark, my previous email where I could sketch the taxonomy [edited]:

Remember that WorkChains now are also processes able to finish with exit code.
Given that:

We might want to mark as Excepted the Calculations that failed badly, with no
reasonable information produced [I mean the CML file first of all].

~~We might want to mark as Excepted the WorkChains that were not scripted properly with python~~
~~(analog of badly assembled Siesta executable).~~
Also, if a WorkChain contains an Excepted calculation, it should also become Excepted at once.
~~Perhaps it works already this way, but I'm not sure.~~
On the contrary, I'm sure that sometimes it doesn't:
e.g. now, the bug in checkpoints leads to shouting serialization exceptions in the shell, but is non-lethal to the WorkChain.
I think it's because aiida's job submissions involve concurrency mechanisms that are harder to control than serial code.
[Important: Does a drop of poison infect the tun of wine?
Depends on a size. Small "atomic" workflows that fell out should not corrupt a huge many-days-to-compute chain. At the same time they themselves should be designated as Excepted.
In other words, there needs to be a mechanism to Except a workchain from within, that we may selectively use.]

We might want to mark as Finished with non-zero the Calculations that failed controllably
an can be partly parsed/restarted, relying on the information produced. Actually, we do it now.

We might want to mark as Finished with non-zero the WorkChains that, during execution,
contain the Finished_with_non_zero Calculations that they cannot handle further. I also show how. [In the pre-workshop example of Siesta-restart workchain]

bosonie · 2019-06-04T23:10:17Z

I'd say we all agree on marking "Excepted" the calculations where not even the CML file is created.
I think that if a calculation is excepted inside a workchain, the workchain should be excepted as well.
It should be good practice of any user to write workchains with checkpoints from where one can restart. An excepted workchain should mark a situation when human intervention is needed before restart.

The exit code different from zero I would implement (coming in my mind now) are:
not converged scf
not converged geometry
problem in the basis set specifications (too small split norm for example)
parsing fail of info in .xml (maybe two different from parameters and for forces/stress)
parsing fail of .bands (in case bandkpoints is set)

Anytime I face a new problem I'll post it here.

1) Few error-handeling modifications in the parser. The modifications introduced are sufficient to avoid the crash of the code, but still are not probably the best option. Discussion open in issue "Taxonomy of failure modes #24" 2) Remove files in workflows/workfunction as they were not migreted to aiida 1.0

bosonie · 2020-01-29T10:29:42Z

Few more situations I encountered are the following:

The Siesta calculation crashes leaving the .xml file incomplete (doesn't end with "cml"). In this case "minidom" from "xml.dom" raises an error. To avoid crashing of the siesta parser, we now use a OutputParsingError. However something more clear could be implemented.
Cases when the files to retrieve are all produced, but the Siesta calculation returns an error and the MESSAGES file reports "FATAL:". This happens for instance when there are problems with basis or pseudos. At the moment, in this case, the parser doesn't rise any error. In fact it gathers from the .xml file the few information about the siesta version and then it exits with code 0. We don't have, so far, any minimum requirement of the info to be retrieved. The info "FATAL:" of MESSAGES is parsed, but no action is implemented for that, therefore the calculation exits with code 0. This needs to be changed in my opinion.

bosonie added discussion required Issues that require a bit of discussion enhancement labels Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy of failure modes #24

Taxonomy of failure modes #24

albgar commented May 29, 2019

vdikan commented Jun 3, 2019

vdikan commented Jun 3, 2019

bosonie commented Jun 4, 2019

bosonie commented Jan 29, 2020 •

edited

Loading

Taxonomy of failure modes #24

Taxonomy of failure modes #24

Comments

albgar commented May 29, 2019

vdikan commented Jun 3, 2019

vdikan commented Jun 3, 2019

bosonie commented Jun 4, 2019

bosonie commented Jan 29, 2020 • edited Loading

bosonie commented Jan 29, 2020 •

edited

Loading