Simple calculation (Er diamond) crashing #355

giovannipizzi · 2023-02-21T18:05:03Z

Describe the bug
When running an erbium diamond structure, the NSCF crashes.

To Reproduce
Run this input file: Er-Diamond.xsf.txt, renaming it from .txt to .xsf, with "structure as is", "metal", "non-magnetic", bands + PDOS, moderate protocol. I run with 2 MPI processes and 2 pools.

Expected behavior
I get the bands and PDOS :-)

Screenshots
Top part of my screenshot

roblem

Version (if known)
App version: v23.02.0

Additional context
The file crashed while/after computing the 2nd point, the last lines in the aiida.out file are:

  Starting wfcs are   64 randomized atomic wfcs
     Checking if some PAW data can be deallocated... 

     Band Structure Calculation
     Davidson diagonalization with overlap

     Computing kpt #:     1  of   550 on this pool
     c_bands:  2 eigenvalues not converged
     total cpu time spent up to now is        8.0 secs

     Computing kpt #:     2  of   550 on this pool
     c_bands:  3 eigenvalues not converged

and the stderr shows:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

If I go with the terminal, the CRASH file says


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     task #         1
     from c_bands : error #         1
     too many bands are not converged
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

probably this is the actual cause of the error and this is not shown because it happens on the second pool that cannot print.
@mbercx are we also parsing this file? Probably not? If it were parsed (or if we run with 1 only pool) would the workflow be able to recover? Suggestions on how to fix this?

If useful, here is the workflow report:

2023-02-21 17:04:33 [64 | REPORT]: [461|QeAppWorkChain|run_bands]: launching PwBandsWorkChain<464>
2023-02-21 17:04:34 [65 | REPORT]:     [464|PwBandsWorkChain|run_scf]: launching PwBaseWorkChain<472> in scf mode
2023-02-21 17:04:35 [66 | REPORT]:         [472|PwBaseWorkChain|run_process]: launching PwCalculation<477> iteration #1
2023-02-21 17:08:36 [67 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<477> run with smearing and highest band is occupied
2023-02-21 17:08:36 [68 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: BandsData<480> has invalid occupations: Occupation of 0.02419823770664454 at last band lkn<0,32,26>
2023-02-21 17:08:36 [69 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<477> had insufficient bands
2023-02-21 17:08:36 [70 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: Action taken: increased number of bands to 30 and restarting from the previous charge density.
2023-02-21 17:08:36 [71 | REPORT]:         [472|PwBaseWorkChain|inspect_process]: PwCalculation<477> finished successfully but a handler was triggered, restarting
2023-02-21 17:08:36 [72 | REPORT]:         [472|PwBaseWorkChain|run_process]: launching PwCalculation<485> iteration #2
2023-02-21 17:09:52 [73 | REPORT]:         [472|PwBaseWorkChain|results]: work chain completed after 2 iterations
2023-02-21 17:09:52 [74 | REPORT]:         [472|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:09:52 [75 | REPORT]:     [464|PwBandsWorkChain|run_bands]: launching PwBaseWorkChain<493> in bands mode
2023-02-21 17:09:53 [76 | REPORT]:         [493|PwBaseWorkChain|run_process]: launching PwCalculation<496> iteration #1
2023-02-21 17:29:21 [78 | REPORT]:         [493|PwBaseWorkChain|results]: work chain completed after 1 iterations
2023-02-21 17:29:21 [79 | REPORT]:         [493|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:21 [80 | REPORT]:     [464|PwBandsWorkChain|results]: workchain succesfully completed
2023-02-21 17:29:21 [81 | REPORT]:     [464|PwBandsWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:22 [82 | REPORT]: [461|QeAppWorkChain|run_pdos]: launching PdosWorkChain<504>
2023-02-21 17:29:22 [83 | REPORT]:     [504|PdosWorkChain|run_nscf]: launching NSCF PwBaseWorkChain<506>
2023-02-21 17:29:23 [84 | REPORT]:         [506|PwBaseWorkChain|run_process]: launching PwCalculation<511> iteration #1
2023-02-21 17:29:39 [89 | REPORT]:         [506|PwBaseWorkChain|report_error_handled]: PwCalculation<511> failed with exit status 312: The stdout output file was incomplete probably because the calculation got interrupted.
2023-02-21 17:29:39 [90 | REPORT]:         [506|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting...
2023-02-21 17:29:39 [91 | REPORT]:         [506|PwBaseWorkChain|inspect_process]: PwCalculation<511> failed but a handler detected an unrecoverable problem, aborting
2023-02-21 17:29:39 [92 | REPORT]:         [506|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:39 [93 | REPORT]:     [504|PdosWorkChain|inspect_nscf]: NSCF PwBaseWorkChain failed with exit status 300
2023-02-21 17:29:39 [94 | REPORT]:     [504|PdosWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:40 [95 | REPORT]: [461|QeAppWorkChain|inspect_pdos]: PdosWorkChain failed with exit status 402
2023-02-21 17:29:40 [96 | REPORT]: [461|QeAppWorkChain|on_terminated]: remote folders will not be cleaned

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2023-02-21T20:52:55Z

As a follow-up, if I run in a single pool, the error is printed in the stdout, and indeed the parser detects it and sets the exit code of the PwCalculation to 463 with a message

2023-02-21 18:48:04 [117 | REPORT]: [588|PwBaseWorkChain|run_process]: launching PwCalculation<593> iteration #1
2023-02-21 18:48:27 [123 | REPORT]: [588|PwBaseWorkChain|report_error_handled]: PwCalculation<593> failed with exit status463: Too many bands failed to converge during the diagonalization.
2023-02-21 18:48:27 [124 | REPORT]: [588|PwBaseWorkChain|report_error_handled]: Action taken: found diagonalization issues, switching to conjugate gradient diagonalization.
2023-02-21 18:48:27 [125 | REPORT]: [588|PwBaseWorkChain|inspect_process]: PwCalculation<593> failed but a handler dealt with the problem, restarting

(and it's now running, I think it's almost finished).

So I think that indeed we need to also check the CRASH file if available (retrieve it and add to the parsing), when multiple pools exist and the error might not be printed by the main pool with IO.

This is the part of the code where the CRASH file is written, for reference.
https://gitlab.com/QEF/q-e/-/blob/068d2d5b9360ed7f21e629dccb1021689aabc507/UtilXlib/error_handler.f90#L93

I think multiple errors could appear in there, so we should just use the same logic as used now in the main file (or even parse also that file at the end so the same logic is then triggered).

@mbercx what do you think?
Feel free to transfer this issue to aiida-qe, probably it's a more appropriate place?

unkcpz · 2023-04-11T08:23:49Z

fixed in aiidateam/aiida-quantumespresso#890. @mbercx which version is the aiida-quantumespresso include the fix?

mbercx · 2023-04-11T09:03:38Z

Still have to make a release! Just wanted to still sneak in aiidateam/aiida-quantumespresso#902 since this would also remove the warnings for the PwBandsWorkChain that you reported. Am travelling to Paris for a conference today, but hopefully can work on this again Thursday.

unkcpz · 2023-04-11T16:10:19Z

Thanks @mbercx. I want to have a look at aiidateam/aiida-quantumespresso#902, will do it by tomorrow.

unkcpz · 2023-10-22T20:16:24Z

Hi @mbercx, can you check if this issue is fixed? We use aiida-quantumespresso==4.3.0 in the app.
To start and test with the latest version of qeapp, I just did make a beta release today, so simply run docker run --rm -it -p 8888:8888 aiidalab/qe:latest and can directly start to prepare the simulation.

mbercx · 2023-10-26T21:26:27Z

Ran with 2 CPUs/pools, error was caught properly and calculation restarted with different diagonalisation approach as expected. So I think this is fixed!

giovannipizzi added the bug Something isn't working label Feb 21, 2023

giovannipizzi changed the title ~~Simple calculation crashing~~ Simple calculation (Er diamond) crashing Feb 21, 2023

unkcpz assigned mbercx Feb 22, 2023

sphuber mentioned this issue Feb 23, 2023

Retrieve and parse the CRASH file for pw.x aiidateam/aiida-quantumespresso#890

Closed

unkcpz added this to the v2023.10.0 milestone Feb 28, 2023

mbercx mentioned this issue Oct 26, 2023

Bug report: Application crashed with UnknownFileTypeError #538

Open

mbercx closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple calculation (Er diamond) crashing #355

Simple calculation (Er diamond) crashing #355

giovannipizzi commented Feb 21, 2023

giovannipizzi commented Feb 21, 2023

unkcpz commented Apr 11, 2023

mbercx commented Apr 11, 2023

unkcpz commented Apr 11, 2023

unkcpz commented Oct 22, 2023

mbercx commented Oct 26, 2023

Simple calculation (Er diamond) crashing #355

Simple calculation (Er diamond) crashing #355

Comments

giovannipizzi commented Feb 21, 2023

giovannipizzi commented Feb 21, 2023

unkcpz commented Apr 11, 2023

mbercx commented Apr 11, 2023

unkcpz commented Apr 11, 2023

unkcpz commented Oct 22, 2023

mbercx commented Oct 26, 2023