Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple calculation (Er diamond) crashing #355

Closed
giovannipizzi opened this issue Feb 21, 2023 · 6 comments
Closed

Simple calculation (Er diamond) crashing #355

giovannipizzi opened this issue Feb 21, 2023 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@giovannipizzi
Copy link
Member

Describe the bug
When running an erbium diamond structure, the NSCF crashes.

To Reproduce
Run this input file: Er-Diamond.xsf.txt, renaming it from .txt to .xsf, with "structure as is", "metal", "non-magnetic", bands + PDOS, moderate protocol. I run with 2 MPI processes and 2 pools.

Expected behavior
I get the bands and PDOS :-)

Screenshots
Top part of my screenshot
Screenshot 2023-02-21 at 19 00 34
roblem

Version (if known)
App version: v23.02.0

Additional context
The file crashed while/after computing the 2nd point, the last lines in the aiida.out file are:

  Starting wfcs are   64 randomized atomic wfcs
     Checking if some PAW data can be deallocated... 

     Band Structure Calculation
     Davidson diagonalization with overlap

     Computing kpt #:     1  of   550 on this pool
     c_bands:  2 eigenvalues not converged
     total cpu time spent up to now is        8.0 secs

     Computing kpt #:     2  of   550 on this pool
     c_bands:  3 eigenvalues not converged

and the stderr shows:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

If I go with the terminal, the CRASH file says


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     task #         1
     from c_bands : error #         1
     too many bands are not converged
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

probably this is the actual cause of the error and this is not shown because it happens on the second pool that cannot print.
@mbercx are we also parsing this file? Probably not? If it were parsed (or if we run with 1 only pool) would the workflow be able to recover? Suggestions on how to fix this?

If useful, here is the workflow report:

2023-02-21 17:04:33 [64 | REPORT]: [461|QeAppWorkChain|run_bands]: launching PwBandsWorkChain<464>
2023-02-21 17:04:34 [65 | REPORT]:     [464|PwBandsWorkChain|run_scf]: launching PwBaseWorkChain<472> in scf mode
2023-02-21 17:04:35 [66 | REPORT]:         [472|PwBaseWorkChain|run_process]: launching PwCalculation<477> iteration #1
2023-02-21 17:08:36 [67 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<477> run with smearing and highest band is occupied
2023-02-21 17:08:36 [68 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: BandsData<480> has invalid occupations: Occupation of 0.02419823770664454 at last band lkn<0,32,26>
2023-02-21 17:08:36 [69 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: PwCalculation<477> had insufficient bands
2023-02-21 17:08:36 [70 | REPORT]:         [472|PwBaseWorkChain|sanity_check_insufficient_bands]: Action taken: increased number of bands to 30 and restarting from the previous charge density.
2023-02-21 17:08:36 [71 | REPORT]:         [472|PwBaseWorkChain|inspect_process]: PwCalculation<477> finished successfully but a handler was triggered, restarting
2023-02-21 17:08:36 [72 | REPORT]:         [472|PwBaseWorkChain|run_process]: launching PwCalculation<485> iteration #2
2023-02-21 17:09:52 [73 | REPORT]:         [472|PwBaseWorkChain|results]: work chain completed after 2 iterations
2023-02-21 17:09:52 [74 | REPORT]:         [472|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:09:52 [75 | REPORT]:     [464|PwBandsWorkChain|run_bands]: launching PwBaseWorkChain<493> in bands mode
2023-02-21 17:09:53 [76 | REPORT]:         [493|PwBaseWorkChain|run_process]: launching PwCalculation<496> iteration #1
2023-02-21 17:29:21 [78 | REPORT]:         [493|PwBaseWorkChain|results]: work chain completed after 1 iterations
2023-02-21 17:29:21 [79 | REPORT]:         [493|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:21 [80 | REPORT]:     [464|PwBandsWorkChain|results]: workchain succesfully completed
2023-02-21 17:29:21 [81 | REPORT]:     [464|PwBandsWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:22 [82 | REPORT]: [461|QeAppWorkChain|run_pdos]: launching PdosWorkChain<504>
2023-02-21 17:29:22 [83 | REPORT]:     [504|PdosWorkChain|run_nscf]: launching NSCF PwBaseWorkChain<506>
2023-02-21 17:29:23 [84 | REPORT]:         [506|PwBaseWorkChain|run_process]: launching PwCalculation<511> iteration #1
2023-02-21 17:29:39 [89 | REPORT]:         [506|PwBaseWorkChain|report_error_handled]: PwCalculation<511> failed with exit status 312: The stdout output file was incomplete probably because the calculation got interrupted.
2023-02-21 17:29:39 [90 | REPORT]:         [506|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting...
2023-02-21 17:29:39 [91 | REPORT]:         [506|PwBaseWorkChain|inspect_process]: PwCalculation<511> failed but a handler detected an unrecoverable problem, aborting
2023-02-21 17:29:39 [92 | REPORT]:         [506|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:39 [93 | REPORT]:     [504|PdosWorkChain|inspect_nscf]: NSCF PwBaseWorkChain failed with exit status 300
2023-02-21 17:29:39 [94 | REPORT]:     [504|PdosWorkChain|on_terminated]: remote folders will not be cleaned
2023-02-21 17:29:40 [95 | REPORT]: [461|QeAppWorkChain|inspect_pdos]: PdosWorkChain failed with exit status 402
2023-02-21 17:29:40 [96 | REPORT]: [461|QeAppWorkChain|on_terminated]: remote folders will not be cleaned
@giovannipizzi giovannipizzi added the bug Something isn't working label Feb 21, 2023
@giovannipizzi giovannipizzi changed the title Simple calculation crashing Simple calculation (Er diamond) crashing Feb 21, 2023
@giovannipizzi
Copy link
Member Author

As a follow-up, if I run in a single pool, the error is printed in the stdout, and indeed the parser detects it and sets the exit code of the PwCalculation to 463 with a message

2023-02-21 18:48:04 [117 | REPORT]: [588|PwBaseWorkChain|run_process]: launching PwCalculation<593> iteration #1
2023-02-21 18:48:27 [123 | REPORT]: [588|PwBaseWorkChain|report_error_handled]: PwCalculation<593> failed with exit status463: Too many bands failed to converge during the diagonalization.
2023-02-21 18:48:27 [124 | REPORT]: [588|PwBaseWorkChain|report_error_handled]: Action taken: found diagonalization issues, switching to conjugate gradient diagonalization.
2023-02-21 18:48:27 [125 | REPORT]: [588|PwBaseWorkChain|inspect_process]: PwCalculation<593> failed but a handler dealt with the problem, restarting

(and it's now running, I think it's almost finished).

So I think that indeed we need to also check the CRASH file if available (retrieve it and add to the parsing), when multiple pools exist and the error might not be printed by the main pool with IO.

This is the part of the code where the CRASH file is written, for reference.
https://gitlab.com/QEF/q-e/-/blob/068d2d5b9360ed7f21e629dccb1021689aabc507/UtilXlib/error_handler.f90#L93

I think multiple errors could appear in there, so we should just use the same logic as used now in the main file (or even parse also that file at the end so the same logic is then triggered).

@mbercx what do you think?
Feel free to transfer this issue to aiida-qe, probably it's a more appropriate place?

@unkcpz
Copy link
Member

unkcpz commented Apr 11, 2023

fixed in aiidateam/aiida-quantumespresso#890. @mbercx which version is the aiida-quantumespresso include the fix?

@mbercx
Copy link
Member

mbercx commented Apr 11, 2023

Still have to make a release! Just wanted to still sneak in aiidateam/aiida-quantumespresso#902 since this would also remove the warnings for the PwBandsWorkChain that you reported. Am travelling to Paris for a conference today, but hopefully can work on this again Thursday.

@unkcpz
Copy link
Member

unkcpz commented Apr 11, 2023

Thanks @mbercx. I want to have a look at aiidateam/aiida-quantumespresso#902, will do it by tomorrow.

@unkcpz
Copy link
Member

unkcpz commented Oct 22, 2023

Hi @mbercx, can you check if this issue is fixed? We use aiida-quantumespresso==4.3.0 in the app.
To start and test with the latest version of qeapp, I just did make a beta release today, so simply run docker run --rm -it -p 8888:8888 aiidalab/qe:latest and can directly start to prepare the simulation.

@mbercx
Copy link
Member

mbercx commented Oct 26, 2023

Ran with 2 CPUs/pools, error was caught properly and calculation restarted with different diagonalisation approach as expected. So I think this is fixed!

@mbercx mbercx closed this as completed Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants