Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PwRelaxWorkChain: Revisit meta iterations and final SCF #705

Open
3 tasks
Tracked by #406
mbercx opened this issue Jul 20, 2021 · 9 comments · May be fixed by #952
Open
3 tasks
Tracked by #406

PwRelaxWorkChain: Revisit meta iterations and final SCF #705

mbercx opened this issue Jul 20, 2021 · 9 comments · May be fixed by #952

Comments

@mbercx
Copy link
Member

mbercx commented Jul 20, 2021

Currently, the PwRelaxWorkChain performs several relaxations (meta iterations) in case the change in volume of the relaxation is above 1%:

# After first iteration, simply set the cell volume and restart the next base workchain
if not prev_cell_volume:
self.ctx.current_cell_volume = curr_cell_volume
# If meta convergence is switched off we are done
if not self.ctx.meta_convergence:
self.ctx.is_converged = True
return
# Check whether the cell volume is converged
volume_threshold = self.inputs.volume_convergence.value
volume_difference = abs(prev_cell_volume - curr_cell_volume) / prev_cell_volume
if volume_difference < volume_threshold:
self.ctx.is_converged = True
self.report(
'relative cell volume difference {} smaller than convergence threshold {}'.format(
volume_difference, volume_threshold
)
)
else:
self.report(
'current relative cell volume difference {} larger than convergence threshold {}'.format(
volume_difference, volume_threshold
)
)

One possible motivation for this could be the (isotropic?) Pulay stress: checking the volume change and repeating the calculation in case it is above a certain threshold would be a viable fix for this problem. There could be other reasons though, so we have to carefully check for some runs that this is indeed the case.

If the only motivation is the Pulay stress, I think we can remove this meta-convergence and simply rely on the error handler in the PwBaseWorkChain that checks if the pressure obtained from the final SCF with a regenerated basis set is still below the threshold. In case the pressure is converged to the threshold in the vc-relax, the resulting pressure from this final SCF should be exactly the Pulay stress (plus-minus the threshold, of course).

Regarding the final SCF: an important historical use case for this was the fact that some structures came from databases where the license has restrictions on making the structures publicly available. Since the original structures are stored in the provenance, the extra SCF is important to cut off the published provenance at this point.

Conclusion - TODO

  • Add an initial (optional) relaxation with its very own namespace.
  • The "meta-convergence", i.e. the multiple relaxations performed in the PwRelaxWorkChain, should not depend on the change in volume. Rather, it should check for two things:
    • Did the PwBaseWorkChain finish with exit code 501? If so, there are still Pulay stresses in the system and another vc-relax should be run.
    • If the k-points were provided as a density (i.e. through kpoints_distance) and we regenerate the k-points mesh, did it change? If so, the initially generated mesh does not correspond to the one requested by the user, and we should rerun the relaxation with the correct k-points mesh.
  • Remove the final SCF.
@giovannipizzi
Copy link
Member

Additional (historical) note: the final SCF (with printing the stress and forces turned on) would allow to have an indication, in the provenance, of the forces and stresses (if they are below a certain threshold, we are allowed by the license to publish the data).
Without the final SCF, one could still cut the whole provenance, and use the final structure as the "initial" node, but then one would not know if the stress and forces are below the threshold that was decided.

@mbercx
Copy link
Member Author

mbercx commented Jan 20, 2022

Some more investigations on this, looking at 2510 work chains from the MC3D that ran 3 meta-convergence vc-relax:

First some statistics. Out of 2510 work chains:

  • 1886 have 3 meta-convergence steps
  • 456 have 4 meta-convergence steps
  • 168 have 5 meta-convergence steps

I'll focus on looking at the ones with 3 meta-convergence steps, then we can simply look at step 2. A couple of reasons why a third step could be necessary:

  1. Change in k-points grid: If the unit cell changes significantly, it's possible that the original k-points mesh no longer represents the kpoints density defined by kpoints_distance. This happened on the second step in 1227 cases, and might be a legitimate reason for keeping the meta-convergence.
  2. Forces weren't converged for the final SCF at the end of the pw.x run. So, basically the PwBaseWorkChain has exit code 501. Unless I'm missing something, I would simply fix this with an error handler for the PwBaseWorkChain, since this is most likely due to the Pulay stresses discussed above. Note that there is some overlap with the runs in [1].
  3. The highest band was occupied in the first PwCalculation of the second PwBaseWorkChain (328 cases). Seems these almost always follow after PwBaseWorkChain which exited with status 501. This is most likely an indication that in the first PwBaseWorkChain, the number of bands was also insufficient, but this is not handled since the sanity_check_insufficient_bands handler is only called for exit status 0. Since the forces in the first PwBaseWorkChain were most likely erroneous, it makes sense that another meta convergence step is needed since this error is corrected in the second PwBaseWorkChain. If exit status 501 is handled by the PwBaseWorkChain, this should no longer be an issue.

To proceed, I would have exit code 501 restart the calculation with RestartType.FULL. I guess I don't fully agree with this explanation:

if self.is_ionically_converged(trajectory, except_final_scf=True):
# The forces and stresses of ionic cycle are below threshold, but those of the final SCF exceed them.
# This is not necessarily a problem since the calculation starts from scratch after the variable cell
# relaxation and the forces and stresses can be slightly different. Still it is useful to distinguish
# these calculations so we return a special exit code.
return self.exit_codes.ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF

Since the reason why the forces/stresses can be different is because the basis set is incomplete for the new unit cell, which is caught by the final SCF since afaik the basis set is regenerated before this static run.

The meta convergence can remain, but I would add a feature as discussed in #733, i.e. do a first run with much looser precision settings.

@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

Thanks for the analysis @mbercx . Good to know that the meta-convergence still has a true raison d'être.

I don't think I understand why you think the comment in the parser you highlight is incorrect. The basis sets are recomputed and so the computed forces/stress may indeed very well be different if the basis set changes a lot with respect to the one used in the vc-relax cycle.

At the time I decided to not handle the 501 in the PwBaseWorkChain itself, because I thought that in certain situations, this may very well be sufficient for the caller and forcing a full restart may not always be necessary. That is why I decided to just move the responsibility of handling it to the PwRelaxWorkChain which would anyway rerun it through the meta-convergence loop. If you think this should always be handle in the base, I would be ok with this I guess

@mbercx
Copy link
Member Author

mbercx commented Jan 24, 2022

I don't think I understand why you think the comment in the parser you highlight is incorrect.

I'm not 100% sure that "this is not necessarily a problem". If the basis set turned out to be incomplete due to the change in unit cell, isn't the final structure incorrect?

The only case which I can think of where 501 would be acceptable is if the user is going to re-run the vc-relax anyways, which to be fair is true when running inside the meta-convergence loop of the PwRelaxWorkChain. This has the additional benefit that doing so will also regenerate the k-point grid. So I suppose there is a benefit in efficiency in allowing this to pass at the PwBaseWorkChain level and maybe this is the best approach after all, assuming users always rely on the PwRelaxWorkChain. ^^

@sphuber
Copy link
Contributor

sphuber commented Jan 24, 2022

If the basis set turned out to be incomplete due to the change in unit cell, isn't the final structure incorrect?

Define "incorrect". Yes, if you were to rerun the calculation with same settings, but just do SCF (instead of a relax) you would find that the stress and or forces actually exceed the requested thresholds.

But it seems that we agree. I also can see this error just being handled in the base by restarting. Since the restart should be fast, maybe there is no harm in doing it always. As you say, my solution relies on going through the PwRelaxWorkChain which might not always be the case.

@mbercx
Copy link
Member Author

mbercx commented May 5, 2023

@sphuber since I'm planning to (re)run a bunch of relaxation for the MC3D again shortly, I wanted to finally tackle the logic of the PwRelaxWorkChain as described in this issue. In summary:

  • I would add an initial (optional) relaxation with its very own namespace. The use case is described in PwRelaxWorkChain: Use multiple relaxations with increasing precision #733, i.e. it allows the user to set loose precision settings for this initial run, making the work chain more robust and efficient in a lot of cases. Moreover, we can use this work chain to deal with the issue of database licenses by setting the thresholds to those specified in e.g. the MPDS license.
  • The "meta-convergence", i.e. the multiple relaxations performed in the PwRelaxWorkChain, should not depend on the change in volume. Rather, it should check for two things:
    • Did the PwBaseWorkChain finish with exit code 501? If so, there are still Pulay stresses in the system and another vc-relax should be run. Note that I suggest the PwRelaxWorkChain should rerun the relaxation, not the PwBaseWorkChain. I agree there might be use cases where the Pulay stresses should not be corrected at the PwBaseWorkChain level. In fact, [1] is such a use case.
    • If the k-points were provided as a density (i.e. through kpoints_distance) and we regenerate the k-points mesh, did it change? If so, the initially generated mesh does not correspond to the one requested by the user, and we should rerun the relaxation with the correct k-points mesh.
  • Regarding the "final SCF", I'm not sure if it should remain. I suppose one could think of properties the user might want to calculate with this extra SCF and different settings, but then they can just run another PwBaseWorkChain on top. I think the purpose of the PwRelaxWorkChain should be well-defined: "To find the correct geometric ground state of the input structure". So I would vote to remove it.

Happy to hear your thoughts! Also pinging @mkotiuga @giovannipizzi @unkcpz @qiaojunfeng @bastonero to hear what they think.

To still come back to this:

Define "incorrect". Yes, if you were to rerun the calculation with same settings, but just do SCF (instead of a relax) you would find that the stress and or forces actually exceed the requested thresholds.

Exactly. And in case this is true for the final meta-iteration, the PwRelaxWorkChain will not have fully relaxed the structure, e.g.:

PwRelaxWorkChain<527353> Finished [0] [3:results]
    ├── PwBaseWorkChain<527360> Finished [501] [2:while_(should_run_process)(2:inspect_process)]
    │   ├── create_kpoints_from_distance<527388> Finished [0]
    │   └── PwCalculation<527407> Finished [501]
    └── PwBaseWorkChain<527925> Finished [501] [2:while_(should_run_process)(2:inspect_process)]
        ├── create_kpoints_from_distance<527931> Finished [0]
        └── PwCalculation<527935> Finished [501]

The final structure here is not converged within the thresholds specified by the user. The main purpose of the PwRelaxWorkChain is to take care of this, else the user might as well just run a PwBaseWorkChain. I'll admit this case is rather rare, a lot of times the second meta-iteration does finish with exit code zero. But not always, and in a lot of cases the second iteration is run when it is not necessary.

@sphuber
Copy link
Contributor

sphuber commented May 21, 2023

I would add an initial (optional) relaxation with its very own namespace. The use case is described in #705, i.e. it allows the user to set loose precision settings for this initial run, making the work chain more robust and efficient in a lot of cases. Moreover, we can use this work chain to deal with the issue of database licenses by setting the thresholds to those specified in e.g. the MPDS license.

I think a pre-relax could make sense as long as it is optional. Were you thinking of just a single PwBaseWorkChain with some loose convergence parameters? And as long as it is not a complete failure, go to the meta-convergence loop with the strict parameters?

The "meta-convergence", i.e. the multiple relaxations performed in the PwRelaxWorkChain, should not depend on the change in volume. Rather, it should check for two things:
Did the PwBaseWorkChain finish with exit code 501? If so, there are still Pulay stresses in the system and another vc-relax should be run. Note that I suggest the PwRelaxWorkChain should rerun the relaxation, not the PwBaseWorkChain. I agree there might be use cases where the Pulay stresses should not be corrected at the PwBaseWorkChain level. In fact, [1] is such a use case.

This is already the case, I believe. A handler exists for ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF that will simply attach the results and terminate the PwBaseWorkChain. And so I believe the current logic of the PwRelaxWorkChain is correct as it also checks for that exit code and restart once more.

If the k-points were provided as a density (i.e. through kpoints_distance) and we regenerate the k-points mesh, did it change? If so, the initially generated mesh does not correspond to the one requested by the user, and we should rerun the relaxation with the correct k-points mesh.

Agreed.

Regarding the "final SCF", I'm not sure if it should remain. I suppose one could think of properties the user might want to calculate with this extra SCF and different settings, but then they can just run another PwBaseWorkChain on top. I think the purpose of the PwRelaxWorkChain should be well-defined: "To find the correct geometric ground state of the input structure". So I would vote to remove it.

Historically, there were two reasons for this I believe:

  1. In older versions of QE (and here I am talking pre v6.0), pw.x would not perform a final scf within the calculation itself. Therefore, the original workflow even written in the old AiiDA workflow system of v0.x used to perform this manually. Now this final scf is done by pw.x itself, where it recalculates the grid after the cell has changed in a vc-relax, so it should no longer be necessary.
  2. At the start of MC3D, we decided to keep the final scf such that we could use that as the calculation to share publicly, since at that point the structure should be converged and we would be allowed to share it according to the MPDS license. So the same argument as you made for having the pre-relax step at loose tolerances.

If this a complete list of the reasons (I might be forgetting some arguments), then it would indeed make sense to get rid of the final scf in the PwRelaxWorkChain.

The final structure here is not converged within the thresholds specified by the user. The main purpose of the PwRelaxWorkChain is to take care of this, else the user might as well just run a PwBaseWorkChain. I'll admit this case is rather rare, a lot of times the second meta-iteration does finish with exit code zero. But not always, and in a lot of cases the second iteration is run when it is not necessary.

This is actually strange isn't it? If the first 501 is most likely because the initial structure changed a lot during the vc-relax and so the final scf exceeded the tolerances. When restarting at that relaxed structure, it should be close to the minimum and so the second PwBaseWorkChain should be quick to minimize, and we definitely shouldn't expect a large cell change, and therefore no 501. Does the second 501 have another reason? Did the second workchain indeed see a large cell change?

@mbercx
Copy link
Member Author

mbercx commented May 21, 2023

I think a pre-relax could make sense as long as it is optional. Were you thinking of just a single PwBaseWorkChain with some loose convergence parameters? And as long as it is not a complete failure, go to the meta-convergence loop with the strict parameters?

Yes, exactly that. Just a single (optional) run with looser convergence parameters. I'm still wondering if it should be run by default. If not, we'd still want a protocol for this initial relaxation, I think.

And so I believe the current logic of the PwRelaxWorkChain is correct as it also checks for that exit code and restart once more.

Does it? I can only find a reference to the ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF exit code here:

acceptable_statuses = ['ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF']
if workchain.is_excepted or workchain.is_killed:
self.report('relax PwBaseWorkChain was excepted or killed')
return self.exit_codes.ERROR_SUB_PROCESS_FAILED_RELAX
if workchain.is_failed and workchain.exit_status not in PwBaseWorkChain.get_exit_statuses(acceptable_statuses):
self.report(f'relax PwBaseWorkChain failed with exit status {workchain.exit_status}')
return self.exit_codes.ERROR_SUB_PROCESS_FAILED_RELAX

And there it is simply used to allow this exit code, i.e. it doesn't make the work chain exit with ERROR_SUB_PROCESS_FAILED_RELAX.

The context variable is_converged is set in three locations in the inspect_relax method:

  1. To allow the absence of an output_structure in case calculation is scf:

try:
structure = workchain.outputs.output_structure
except exceptions.NotExistent:
# If the calculation is set to 'scf', this is expected, so we are done
if self.ctx.relax_inputs.pw.parameters['CONTROL']['calculation'] == 'scf':
self.ctx.is_converged = True
return

  1. To finish the meta convergence loop in case it is disabled:

# If meta convergence is switched off we are done
if not self.ctx.meta_convergence:
self.ctx.is_converged = True
return

  1. The actual check for a default PwRelaxWorkChain run with meta convergence enabled. Here convergence of the meta iterations is only decided upon using the (relative) change in volume:

# Check whether the cell volume is converged
volume_threshold = self.inputs.volume_convergence.value
volume_difference = abs(prev_cell_volume - curr_cell_volume) / prev_cell_volume
if volume_difference < volume_threshold:
self.ctx.is_converged = True
self.report(
f'relative cell volume difference {volume_difference} smaller than threshold {volume_threshold}'
)
else:
self.report(
f'current relative cell volume difference {volume_difference} larger than threshold {volume_threshold}'
)
self.ctx.current_cell_volume = curr_cell_volume
return

If this a complete list of the reasons (I might be forgetting some arguments), then it would indeed make sense to get rid of the final scf in the PwRelaxWorkChain.

There is still the comment from @giovannipizzi above. I.e. we want to have "proof" that the license conditions are met. This is a bit tricky with the pre-relax approach: continuing from the "pre-relaxed" structure, there is no guarantee that on the first SCF from the "proper" the forces and stresses will be lower than these thresholds. I guess it does point to the somewhat arbitrary nature of these conditions, since they will obviously depend on the chosen computational parameters.

Frankly, I would not let the structure of the PwRelaxWorkChain depend too much of what we need for the MC3D. The pre-relax' main purpose is to increase performance and robustness. If we need proof that the structures we relaxed in this step agree with the license conditions for publishing, we can run a script to extract the final SCF run from the pre-relax and add it as a SingleFileData to the archive.

This is actually strange isn't it? If the first 501 is most likely because the initial structure changed a lot during the vc-relax and so the final scf exceeded the tolerances. When restarting at that relaxed structure, it should be close to the minimum and so the second PwBaseWorkChain should be quick to minimize, and we definitely shouldn't expect a large cell change, and therefore no 501. Does the second 501 have another reason? Did the second workchain indeed see a large cell change?

I'd have to find it again (I tried, but too many projects 😅). To be clear, these instances are rare. In the vast majority of the cases, the volume change is a good indicator that you might have Pulay stresses. But now that QE runs a final SCF for each vc-relax anyways, looking at the final stresses is a more natural one. You don't have to think about how much the volume can change before it might be problematic. In case the final SCF indicates either the forces or stresses are above the threshold, you run another relaxation. It's straightforward and ever so slightly more precise.

@mbercx
Copy link
Member Author

mbercx commented May 22, 2023

Found the example PwRelaxWorkChain from above again. Basically the calculation had barely met the pressure thresholds in the second relaxation, to then just fail to meet the threshold in the final SCF:

❯ verdi calcjob outputcat 527935| grep P=
          total   stress  (Ry/bohr**3)                   (kbar)     P=       -0.76
          total   stress  (Ry/bohr**3)                   (kbar)     P=       -0.48
          total   stress  (Ry/bohr**3)                   (kbar)     P=       -0.51

There were only a few ionic steps, and the volume didn't change much:

❯ verdi calcjob outputcat 527935| grep volume
     unit-cell volume          =     605.4533 (a.u.)^3
     new unit-cell volume =    605.21985 a.u.^3 (    89.68433 Ang^3 )
     new unit-cell volume =    605.21985 a.u.^3 (    89.68433 Ang^3 )
     unit-cell volume          =     605.2199 (a.u.)^3

So it technically hasn't converged, but only by a hair's breadth. ^^ I still think using the 501 exit code is the way to go, but it's fair that in this case it would run another relaxation for only the tiniest improvement in convergence. But as I've said, these cases are rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants