`PwBaseWorkChain`: revisit restarting from calculations #691

mbercx · 2021-05-19T06:56:16Z

Currently the PwBaseWorkChain keeps track of a parent calculation to restart from in the restart_calc variable in the context. For example, when the user provides a parent_folder input to restart the calculation from, its creator is assigned to the restart_calc context variable here:

aiida-quantumespresso/aiida_quantumespresso/workflows/pw/base.py

Lines 245 to 246 in 9b083e7

    
           if 'parent_folder' in self.ctx.inputs: 
        
               self.ctx.restart_calc = self.ctx.inputs.parent_folder.creator

This context variable is later used to set the CONTROL.restart_mode input tag of the pw.x calculation:

aiida-quantumespresso/aiida_quantumespresso/workflows/pw/base.py

Lines 415 to 420 in 9b083e7

    
           if self.ctx.restart_calc: 
        
               self.ctx.inputs.parameters['CONTROL']['restart_mode'] = 'restart' 
        
               self.ctx.inputs.parent_folder = self.ctx.restart_calc.outputs.remote_folder 
        
           else: 
        
               self.ctx.inputs.parameters['CONTROL']['restart_mode'] = 'from_scratch' 
        
               self.ctx.inputs.pop('parent_folder', None)

However, there is a problem in case the parent_folder has no creator, for example when the user has stashed some calculation outputs like the charge density from which she/he later wants to restart.

I think we should adapt the code so the creator is no longer necessary.

The `restart_mode` input tag

EDIT: I've edited the message below because I had made a mistake during my tests on this issue.

The explanation on the restart_mode is a little confusing, and gives the impression that it can only be used to restart from an interrupted run. I've done a bit of testing on a simple SCF calculation on Si to see what happens when you set this to 'restart', compared to using startingpot or startingwfc:

restart_mode = 'restart': You find the following lines in the output file:
```
   The initial density is read from file :
   ./out/aiida.save/charge-density

   Starting wfcs from file
```
The calculation finishes in 1 SCF step, and the total CPU time is 6.09s.
startingpot = 'file': You find the following lines in the output file:
```
   The initial density is read from file :
   ./out/aiida.save/charge-density

   Starting wfcs are    8 randomized atomic wfcs
```
The calculation finishes in 1 SCF step, and the total CPU time is 11.99s.

startingwfc = 'file': You find the following lines in the output file:

   Initial potential from superposition of free atoms

   starting charge    7.99888, renormalised to    8.00000
   Starting wfcs from file

   total cpu time spent up to now is        0.5 secs

   Self-consistent Calculation

The calculation finished in 7 SCF steps, and the total CPU time is 17.99s.

startingpot = 'file' AND startingwfc = 'file': Same as restart_mode = 'restart'

So, it seems that:

setting restart_mode to 'restart' does restart both interrupted and completed runs properly. In this case both the wave functions and charge density are read from file, which gives a similar result as setting both startingpot and startingwfcto'file'`.
For startingpot = 'file', the calculation only requires 1 SCF step, but it takes longer because the initial wave function coefficients are random. Which is to be expected.
For startingwfc = 'file', it seems that the initial potential (or charge density) is still taken from the superposition of atomic charge densities. Because of this, the calculation still needs 7 SCF steps to converge and this restart method takes the longest. I suppose the question here is why the potential is not constructed from the wave functions before starting the SCF.

TL;DR

In the end, it seems that using restart_mode = 'restart' is indeed the way to go for restarting pw.x calculations for both interrupted and completed runs.

The text was updated successfully, but these errors were encountered:

mbercx · 2021-05-19T07:10:13Z

Thanks to @MackeEric for first drawing my attention to this! Also pinging @ramirezfranciscof and @qiaojunfeng in case they have comments.

mbercx · 2021-05-19T11:18:53Z

So, I was a little rushed in running these tests this morning, and messed up the one for restart_mode = 'restart'. I've updated the OP to correct this.

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

mbercx · 2021-05-19T12:28:54Z

A caveat to using restart_mode = 'restart' is in case e.g. the number of bands has been changed. Then trying to load the wave functions will fail:

     The initial density is read from file :
     ./out/aiida.save/charge-density


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine pw_restart - read_collected_wfc (1):
     The number of bands for this run is    12, but only     8 bands were read from file
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this case, it is interesting to use startingpot = 'file', since only 1 SCF will be needed. We should probably use this setting in the sanity_check_insufficient_bands error handler, which claims to restart at the moment:

aiida-quantumespresso/aiida_quantumespresso/workflows/pw/base.py

Lines 434 to 441 in 9b083e7

    
               @process_handler(exit_codes=ExitCode(0)) 
        
               def sanity_check_insufficient_bands(self, calculation): 
        
                   """Perform a sanity check on the band occupations of a  successfully converged calculation. 
        
                   Verify that the occupation of the last band is below a certain threshold, unless `occupations` was explicitly 
        
                   set to `fixed` in the input parameters. If this is violated, the calculation used too few bands and cannot be 
        
                   trusted. The number of bands is increased and the calculation is restarted, starting from the last. 
        
                   """

But as far as I can see it doesn't 😅 , which is what is also reported:

aiida-quantumespresso/aiida_quantumespresso/workflows/pw/base.py

Line 476 in 9b083e7

    
           self.report(f'Action taken: increased number of bands to {nbnd_new} and restarting from scratch')

So we probably should still distinguish between different restart modes to deal with cases like this.

MackeEric · 2021-05-20T15:19:52Z

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

So in my experience, I have seen a somehow inconsistent behaviour, at least with QE v6.6. In a test scenario I designed on my own, I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run (even with nbnd and tot_magnetization set to the same values as outputted by the smearing run!). In principle, this should not cause any trouble but in reality it often does, yielding PW errors such as too many bands are not converged directly after the first iteration.

On the other hand, the very same restarts worked perfectly fine if I only used startingpot = 'file' while leaving restart_mode = 'restart'.

mbercx · 2021-05-20T16:47:57Z

Thanks @MackeEric!

I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

One difference I realised while doing some more testing is that restart_mode = 'restart' also reads the structure from the previous calculation:

     Atomic positions and unit cell read from directory:
     ./out/aiida.save/
     Atomic positions from file used, from input discarded

So, when e.g. restarting from a previous relax calculation, using restart_mode = 'restart' just needs one SCF and ionic step, whereas setting startingpot = 'file' and startingwfc = 'file' will use of course use the initial structure.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run

Right, you weren't just restarting the same run, but adapting inputs. I can imagine using the wave functions from the 'smearing' run for a 'fixed' run causes troubles, since suddenly there will be states with partial occupations. I imagine similar errors will occur when using startingwfc = 'file' in this case. (Coindidentally, I did just try this for my simple Si SCF case and there I didn't run into any issues.)

So, using restart_mode = 'restart' is fine in case the run has completed, provided you don't change any calculation settings that cause trouble when loading the wave functions of the previous run.

bastonero · 2021-06-02T13:49:46Z

Hi everyone,

This days I did find @MackeEric very same problem ("too many bands are not converged"), while trying to "restart" a pw calculation in the hubbard workflow. When you want to start from a density of an other pw calculation, but not use the wfcs (as in, eg, the hubbard workflows), it is then necessary to be able to start the calculation using restart_mode='from_scratch' and startingpot='file', since the "smeared" wfcs could be too much different.

Could a solution be having a restart_folder and a parent_folder?
The former is specifically used for restarting, while the latter is used to just copy the out folder (where one may want also to copy only specific part of it as an option for particular purposes? ).

Anyway, I do believe that a solution is needed, especially when I think about the hubbard workflow.

mbercx · 2021-06-02T14:46:24Z

Thanks for the comment @bastonero! I indeed think a possible path forward could be keeping track of a restart_folder instead of a restart_calc and updating the logic where needed to use startingpot = file instead of restart_mode = 'restart'. I just haven't looked into this in much detail, so I'm not sure if it leads to issues somewhere.

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?

bastonero · 2021-06-02T15:01:35Z

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?
Yes, I think it would be great, thanks!

mbercx · 2021-06-04T13:30:44Z

Notes from meeting with @ramirezfranciscof, @timrov, @MackeEric and @bastonero:

We should consider only copying some files when restarting from a calculation. This should be fixable at the plugin level, and could help with reducing restart times when e.g. the wave functions are very big.
Question: When you try restart_mode = 'restart' and some of the outputs (e.g. wfc) are missing, does it crash the calculation?
Question: Is the indeed output structure is read from xml (is this the restart file the documentation mentions?) by default even if restart_mode = 'from_scratch'? See ion_positions input tag.
TODO: write document on all this for other QE users, and as a guideline for the discussion on making possible changes to QE (see below). Also check recent QE school for material on the topic.

Issues to raise on QE GitLab:

The charge density (potential) should be constructed from the wave functions in case startingwfc = 'file'.
Is there a problem with restarting with restart_mode = 'restart' outside of two interrupt cases suggested in the documentation? For example, if you restart from a calculation that has hit the maximum number of ionic iterations, is this ok?
Change the way restarting works in QE:
1. restart_mode = 'restart' should not fail if something is missing. I.e. it should just check what outputs are there and use them to restart. (check question above)
2. The individual settings for restarting (startingpot, startingwfc and ion_positions) should override the outputs that restart_mode uses, not the other way around.
3. For consistency, it might actually be nice to rename ion_positions to e.g. startingion or something like that. Now it's not super clear that this is regarding restarting, and it would make it more in line with startingpot and startingwfc.

Pinging @giovannipizzi for comments, considering his considerable QE experience. 😁

mbercx added type/bug topic/workflows priority/important labels May 19, 2021

mbercx self-assigned this May 19, 2021

mbercx added type/enhancement and removed type/bug labels Jun 6, 2021

ramirezfranciscof mentioned this issue Jul 21, 2021

Add transfer calculation for density restart #706

Open

mbercx mentioned this issue Aug 31, 2021

PwBaseWorkChain: improve restart from parent_folder #722

Merged

This was referenced Apr 9, 2023

Protocols: Move all static work chain inputs to protocol #902

Merged

Docs: Add documentation on restart usage and logic #916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PwBaseWorkChain`: revisit restarting from calculations #691

`PwBaseWorkChain`: revisit restarting from calculations #691

mbercx commented May 19, 2021 •

edited

Loading

mbercx commented May 19, 2021

mbercx commented May 19, 2021 •

edited

Loading

mbercx commented May 19, 2021

MackeEric commented May 20, 2021

mbercx commented May 20, 2021

bastonero commented Jun 2, 2021

mbercx commented Jun 2, 2021

bastonero commented Jun 2, 2021

mbercx commented Jun 4, 2021 •

edited

Loading

PwBaseWorkChain: revisit restarting from calculations #691

PwBaseWorkChain: revisit restarting from calculations #691

Comments

mbercx commented May 19, 2021 • edited Loading

The restart_mode input tag

TL;DR

mbercx commented May 19, 2021

mbercx commented May 19, 2021 • edited Loading

mbercx commented May 19, 2021

MackeEric commented May 20, 2021

mbercx commented May 20, 2021

bastonero commented Jun 2, 2021

mbercx commented Jun 2, 2021

bastonero commented Jun 2, 2021

mbercx commented Jun 4, 2021 • edited Loading

Issues to raise on QE GitLab:

`PwBaseWorkChain`: revisit restarting from calculations #691

`PwBaseWorkChain`: revisit restarting from calculations #691

mbercx commented May 19, 2021 •

edited

Loading

The `restart_mode` input tag

mbercx commented May 19, 2021 •

edited

Loading

mbercx commented Jun 4, 2021 •

edited

Loading