-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden DMC energy crash with batched drivers and T-moves #3608
Comments
I got confused. Are these runs from July or recent? Could you list the code commit hash? |
I have the same request as @ye-luo : Please add the version (code commit hash, date) of the code used and how it was compiled/where obtained. |
Apart from the run failure due to the trapped walker/bug (etc.) does the run otherwise appear to be "correct"? I am wondering if this is only a population control problem or if there are other issues. |
These runs are from July. We had talked at length offline, but without an official record of the problem. I will try to track down version numbers, etc. |
I saw #3082 (comment) so got confused. It seems the current issue is not energy but population. Could you clarify the up-to-date status? |
I will provide more information when I can obtain all the original files. These were Tomohiro's runs. I generally assume an issue still exists until it has been explicitly addressed and confirmed as a non-problem. |
See if the runs are in the INCITE project directory/can be put there. |
Tomohiro is moving the files to Summit at: /gpfs/alpine/mat151/proj-shared/bugs/01_perovskite_dmc_crash_issue_3608 Given the short file lifetime there, these should be backed up to ALCF. (ALCF backup: /projects/PSFMat_2/jtkrogel/summit_reruns) |
git version info:
Base/full precision are both double. |
Could you rerun with develop-20211012? There are known bugs fixed in develop. Need to re-assess the issue. |
Do you have the old build as well to rerun with? With rare events, absence of evidence is not evidence of absence. |
Including lightweight inputs and outputs: in_out.zip |
Unfortunately the Summit OS upgrade in Summer needs rebuilding all the executables. Old ones are no more working. |
OK. I'm not sure what we will learn from a single rerun then. |
Re-run to see if the issue remains on the latest code. |
See my evidence comment above. We will have to do a number of runs to have any confidence of a fix. |
Even if I make a fix, as long as it changes the trajectory, there is a chance the fix is not a fix but simply avoided the bad configuration. That is why we are moving to unit tests as ways of confirming fixes. |
I will perform several reruns unless someone objects here about cost. |
No objections. |
Describe the bug
DMC batched code (T-moves) can experience sudden crashes in the local energy even with good wavefunctions and within runs that have been running stably for a long time.
See graph below for an energy trace of LFO bulk with the batched code:
Similar runs with the legacy CPU code showed no such issue. Other closely related runs on Summit also manifested this issue. The problem likely relates to the handling of rare events in the branching, hence the utility of a feature like #3292.
Basic energy/variance information:
To Reproduce
Need to check on provenance of the original data as these runs were performed in July.
Expected behavior
DMC T-moves should be stable in the batched code.
System:
Summit
Cell size: ~80 atoms
QMCPACK code used: complex batched/offload code
Walker counts: 140 per V100, 58800 in total
The text was updated successfully, but these errors were encountered: