Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy of failure modes #24

Open
albgar opened this issue May 29, 2019 · 4 comments
Open

Taxonomy of failure modes #24

albgar opened this issue May 29, 2019 · 4 comments
Labels
discussion required Issues that require a bit of discussion enhancement

Comments

@albgar
Copy link
Member

albgar commented May 29, 2019

It would be good to have a list of possible "failure modes" in a Siesta calculation, classified according to their severity and potential for recovery. This information can then be encoded in the plugin/parser and used by the workflows.

We have already identified a broad-brush classification scheme: those errors which occur before a CML file is produced should result in an "Excepted" state. Others might be given a "Failed" state with an appropriate exit code. (Note: we should catch the former before attempting to parse the CML file, and provide a proper error message.)

@vdikan
Copy link
Collaborator

vdikan commented Jun 3, 2019

I think it's already partly there: if no CML is produced, the calculation will have Excepted status with verdi process logshow showing contents of Siesta's Message file. What could be improved:

  • the python stacktrace in case of no CML will always say that, well, there is no CML. Instead we can throw exception with the same Message file contents, eg. in the try-catch block.
  • implement some unrecognized_by_parser error handling procedure, like what goes in the bottom of aiida_quantumespresso/workflows/pw/base.py under exit_code 100: in case we do have CML produced but the final result is not there, and we do not handle this scenario yet.

@vdikan
Copy link
Collaborator

vdikan commented Jun 3, 2019

As a bookmark, my previous email where I could sketch the taxonomy [edited]:

Remember that WorkChains now are also processes able to finish with exit code.
Given that:

  • We might want to mark as Excepted the Calculations that failed badly, with no
    reasonable information produced [I mean the CML file first of all].
  • We might want to mark as Excepted the WorkChains that were not scripted properly with python
    (analog of badly assembled Siesta executable).
    Also, if a WorkChain contains an Excepted calculation, it should also become Excepted at once.
    Perhaps it works already this way, but I'm not sure.
    On the contrary, I'm sure that sometimes it doesn't:
    e.g. now, the bug in checkpoints leads to shouting serialization exceptions in the shell, but is non-lethal to the WorkChain.
    I think it's because aiida's job submissions involve concurrency mechanisms that are harder to control than serial code.
    [Important: Does a drop of poison infect the tun of wine?
    Depends on a size. Small "atomic" workflows that fell out should not corrupt a huge many-days-to-compute chain. At the same time they themselves should be designated as Excepted.
    In other words, there needs to be a mechanism to Except a workchain from within, that we may selectively use.]
  • We might want to mark as Finished with non-zero the Calculations that failed controllably
    an can be partly parsed/restarted, relying on the information produced. Actually, we do it now.
  • We might want to mark as Finished with non-zero the WorkChains that, during execution,
    contain the Finished_with_non_zero Calculations that they cannot handle further. I also show how. [In the pre-workshop example of Siesta-restart workchain]

@bosonie
Copy link
Member

bosonie commented Jun 4, 2019

I'd say we all agree on marking "Excepted" the calculations where not even the CML file is created.
I think that if a calculation is excepted inside a workchain, the workchain should be excepted as well.
It should be good practice of any user to write workchains with checkpoints from where one can restart. An excepted workchain should mark a situation when human intervention is needed before restart.

The exit code different from zero I would implement (coming in my mind now) are:
not converged scf
not converged geometry
problem in the basis set specifications (too small split norm for example)
parsing fail of info in .xml (maybe two different from parameters and for forces/stress)
parsing fail of .bands (in case bandkpoints is set)

Anytime I face a new problem I'll post it here.

@bosonie bosonie added discussion required Issues that require a bit of discussion enhancement labels Nov 15, 2019
bosonie added a commit that referenced this issue Jan 29, 2020
1) Few error-handeling modifications in the parser. The modifications
introduced are sufficient to avoid the crash of the code, but still are
not probably the best option. Discussion open in issue "Taxonomy of
failure modes #24"

2) Remove files in workflows/workfunction as they were not migreted
to aiida 1.0
@bosonie
Copy link
Member

bosonie commented Jan 29, 2020

Few more situations I encountered are the following:

  1. The Siesta calculation crashes leaving the .xml file incomplete (doesn't end with "cml"). In this case "minidom" from "xml.dom" raises an error. To avoid crashing of the siesta parser, we now use a OutputParsingError. However something more clear could be implemented.
  2. Cases when the files to retrieve are all produced, but the Siesta calculation returns an error and the MESSAGES file reports "FATAL:". This happens for instance when there are problems with basis or pseudos. At the moment, in this case, the parser doesn't rise any error. In fact it gathers from the .xml file the few information about the siesta version and then it exits with code 0. We don't have, so far, any minimum requirement of the info to be retrieved. The info "FATAL:" of MESSAGES is parsed, but no action is implemented for that, therefore the calculation exits with code 0. This needs to be changed in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion required Issues that require a bit of discussion enhancement
Projects
None yet
Development

No branches or pull requests

3 participants