Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden DMC energy crash with batched drivers and T-moves #3608

Open
jtkrogel opened this issue Nov 12, 2021 · 21 comments
Open

Sudden DMC energy crash with batched drivers and T-moves #3608

jtkrogel opened this issue Nov 12, 2021 · 21 comments
Labels

Comments

@jtkrogel
Copy link
Contributor

Describe the bug

DMC batched code (T-moves) can experience sudden crashes in the local energy even with good wavefunctions and within runs that have been running stably for a long time.

See graph below for an energy trace of LFO bulk with the batched code:

lfo_bulk_E_trace

Similar runs with the legacy CPU code showed no such issue. Other closely related runs on Summit also manifested this issue. The problem likely relates to the handling of rare events in the branching, hence the utility of a feature like #3292.

Basic energy/variance information:

>qmca -a -q ev lfo_bulk/06_mk_cef/*scalar*
                LocalEnergy                Variance                 ratio 
avg  series 0  -3257.644330 +/- 0.001689   67.354323 +/- 0.055550   0.0207 
avg  series 1  -3260.942260 +/- 0.117516   69.132976 +/- 0.051298   0.0212 
avg  series 2  -3260.899774 +/- 0.001210   68.434296 +/- 0.015824   0.0210 

To Reproduce

Need to check on provenance of the original data as these runs were performed in July.

Expected behavior

DMC T-moves should be stable in the batched code.

System:

Summit

Cell size: ~80 atoms
QMCPACK code used: complex batched/offload code
Walker counts: 140 per V100, 58800 in total

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

I got confused. Are these runs from July or recent? Could you list the code commit hash?

@prckent
Copy link
Contributor

prckent commented Nov 12, 2021

I have the same request as @ye-luo : Please add the version (code commit hash, date) of the code used and how it was compiled/where obtained.

@prckent
Copy link
Contributor

prckent commented Nov 12, 2021

Apart from the run failure due to the trapped walker/bug (etc.) does the run otherwise appear to be "correct"? I am wondering if this is only a population control problem or if there are other issues.

@jtkrogel
Copy link
Contributor Author

These runs are from July. We had talked at length offline, but without an official record of the problem. I will try to track down version numbers, etc.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

I saw #3082 (comment) so got confused. It seems the current issue is not energy but population. Could you clarify the up-to-date status?

@jtkrogel
Copy link
Contributor Author

I will provide more information when I can obtain all the original files. These were Tomohiro's runs. I generally assume an issue still exists until it has been explicitly addressed and confirmed as a non-problem.

@prckent
Copy link
Contributor

prckent commented Nov 12, 2021

See if the runs are in the INCITE project directory/can be put there.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

Between July and Oct, there are two fixes #3507 and #3524 related to T-moves. If the remaining issue is only population not energy, it can be related to the population control part instead of T-moves.

@jtkrogel
Copy link
Contributor Author

jtkrogel commented Nov 12, 2021

Tomohiro is moving the files to Summit at:

/gpfs/alpine/mat151/proj-shared/bugs/01_perovskite_dmc_crash_issue_3608

Given the short file lifetime there, these should be backed up to ALCF.

(ALCF backup: /projects/PSFMat_2/jtkrogel/summit_reruns)

@jtkrogel
Copy link
Contributor Author

jtkrogel commented Nov 12, 2021

git version info:

  Git branch: develop
  Last git commit: 60edba4e71afddaeaf2dc910ffc652b4df6f0f0f-dirty
  Last git commit date: Sat Mar 27 21:04:23 2021 -0500
  Last git commit subject: Merge pull request #3055 from PDoakORNL/addCheckMatrix_for_testing

Base/full precision are both double.

@jtkrogel
Copy link
Contributor Author

A spike in the Kinetic, LocalECP, and NonLocalECP's accompany the total energy crash (local energy and trial energy) and population explosion.

Kinetic:

lfo_bulk_kinetic

LocalECP

lfo_bulk_localecp

NonLocalECP

lfo_bulk_nonlocalecp

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

Could you rerun with develop-20211012? There are known bugs fixed in develop. Need to re-assess the issue.

@jtkrogel
Copy link
Contributor Author

Do you have the old build as well to rerun with? With rare events, absence of evidence is not evidence of absence.

@jtkrogel
Copy link
Contributor Author

Including lightweight inputs and outputs: in_out.zip

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

Unfortunately the Summit OS upgrade in Summer needs rebuilding all the executables. Old ones are no more working.

@jtkrogel
Copy link
Contributor Author

OK. I'm not sure what we will learn from a single rerun then.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

Re-run to see if the issue remains on the latest code.
If we cannot reproduce the issue with the current code, we assume the issue is fixed. I think #3524 is the relevant fix. Without it, coordinates on the GPU is not updated properly and the benefit of T-move was not effective.

@jtkrogel
Copy link
Contributor Author

See my evidence comment above. We will have to do a number of runs to have any confidence of a fix.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 12, 2021

Even if I make a fix, as long as it changes the trajectory, there is a chance the fix is not a fix but simply avoided the bad configuration. That is why we are moving to unit tests as ways of confirming fixes.

@jtkrogel
Copy link
Contributor Author

I will perform several reruns unless someone objects here about cost.

@prckent
Copy link
Contributor

prckent commented Nov 12, 2021

No objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants