Modify DMC branch cutoff and population control #242

ye-luo · 2017-05-31T22:12:57Z

All the short DMC tests pass except short-bccH_1x1x1_ae-dmc_sdj-1-16. See issue #241

A few significant changes:

branch cutoff is now based on ~~variance instead of~~ standard deviation to be size-consistent. ~~This change affects warmup. After the warmup, the cutoff has ALREADY been determined by variance.~~
fuse multiple definition of branch cutoff formula.
Rename vParam[B_SIGMA)] to vParam[B_SIGMA2] and assign variance all the time.
fuse multiple definition of min and max number of walkers per node.
use tight population control (E_trial with the 1/tau term) during DMC warm up to prevent population explosion or distinction.
Use E_now instead of E_ref as the initial sample in energy histogram.

Population control (E_trial) in DMC after warm up has NOT been changed.
Whether to use tight control will be decided with more study on the population bias.

Clean up the inconsistent defintion of B_SIGMA in SimpleFixedNodeBranch. Renamed to B_SIGMA2 instead. Variance(sigma2) is size-consistent but standard deviation(sigma) is not. The inconsistency caused a bug that the inital branch cutoff calulated from sigma severely limits the branching and biases the warmup.

After the warmup, the first data point put in the histogram was using the histogram mean of the warmup part which biases the stage collecting statistics. Instead use the ensemble average of energy from the last iteration E_now as the first data point.

1, In the E_trial, add back the 1/tau prefactor in the population control term only in warmup. Now the population flucturation is much smaller. After warmup, keep the old scheme. 2, Modify doNotBranch to return the new population after the first iteration for adjusting the E_trial in order to prevent the intial populaion explosion in DMC.

The lines in WalkeControlFactory are replaced by calling setMinMax in the WalkerControlBase. Cleanup WalkerControlFactory.

qmc-robot · 2017-05-31T22:13:01Z

Can one of the maintainers verify this patch?

qmc-robot · 2017-05-31T22:14:01Z

Can one of the maintainers verify this patch?

prckent · 2017-06-01T06:15:02Z

ok to test

prckent · 2017-06-01T06:25:24Z

This is a potentially significant change and improvement to the DMC algorithm.

Have you run the long DMC tests with this patch?
Verified that a large production run is unbiased with this change? e.g. One of the TiO2 ones. Please put the qmca output here.

jtkrogel · 2017-06-01T12:21:08Z

I am all for more daylight on the projector (particularly its behavior with varying N and/or Z). We particularly need detailed documentation on exactly what the core DMC algorithm is doing, including the "special sauce".

Testing in the large N limit is warranted; sadly it is in exactly this limit that we will have a hard time telling what is correct (the old version, the new version, or neither?). Since we are moving strongly in this direction (very large N) on ECP, it is probably time to have some strict correctness checks of the algorithm in this limit.

One possibility would be to solve the NI HEG Hamiltonian (KE only, exact nodes) using a bogus Jastrow factor and let DMC project it away. This could provide insight into the old vs new algorithm as a function of N (and also timestep).

jtkrogel · 2017-06-01T12:37:37Z

@ye-luo Could you explain a little more about the change to variance rather than standard deviation? The standard deviation is not size consistent (i.e. does not scale as N) as you say, but I thought the point of the branch cutoff was to filter out the troublesome rare event/heavy tails of the local energy distribution from causing instabilities in the branching weights.

From this point of view, it would make sense to have the cutoff be a meaningful representation of distance relative to the width of the local energy distribution. The standard deviation has this property (at least for non-heavy tailed distributions!) while the variance does not. I could see potentially replacing the standard deviation with a percentile based definition instead (avoids heavy tail problem), but I don't yet see how defining the local energy branch cutoff by the variance would give an improvement.

ye-luo · 2017-06-01T12:40:35Z

Some measurements

Current code with bad cutoff scheme, loose pop control. There is a kink after warm up due to a bad initial value add to the energy historgram.

Old branch cutoff scheme, tight pop control on both warmup and after. The kink is gone but the plateau of the warm up part is still apparent.

New branch cutoff scheme, tight pop control on both warmup and after. The plateau is gone.

Note in that experiment, I used tight pop control also after warm up but not added in this commit. So you will still see a large population fluctuation but smooth energy after warmup.

ye-luo · 2017-06-01T12:44:41Z

@jtkrogel Copy paste from my slides
Current scheme was using the standard deviation during warmup but variance after warmup.

new branch cutoff choice

The cutoff is to secure the simulation and still effective. The 1/tau cutoff dominates when tau is large.

jtkrogel · 2017-06-01T12:54:53Z

I agree that pre- and post-warmup should use the same branch cutoff scheme (with the same N scaling). I just don't understand why it should be sigma^2 since the width of the local energy distribution scales like sqrt(N).

On the population control front, I wouldn't expect normal ("looser") control to be a problem so long as discontinuous changes from warmup to production are avoided. Do you have tests that use normal control throughout while including the other changes? I would be hesitant about tight control (1/(g*tau) with g=1, insists that population return to target each generation, on average) due to increased bias.

ye-luo · 2017-06-01T12:58:22Z

@jtkrogel I have updated the description. The branch cutoff scheme was the same after warmup. You were using it for years. The problem is in warmup and leads to more equlibration (a waste of time) after warmup.

ye-luo · 2017-06-01T13:10:03Z

@jtkrogel I had the same hesitation about the tight pop control for the population bias. For this reason, it is only turned on during warmup but not in the main stage and leave more investigation later.
Actually I didn't see energy difference using normal or tight population control throughout. I prefer tight control during warmup because of a much more stable starting population. The other changes in this PR doesn't help. I noticed that switching from VMC to DMC the population can vary drastically and often cause too many walks (out of memory) or too few walkers (some nodes have no walker and die). Even if a population after warmup was not too bad. it takes very long time for the loose pop control to recover the target population and cause large inefficiency, either long time to solution or idle threads.

ye-luo · 2017-06-01T13:24:10Z

@jtkrogel I cannot explain you why. I would naturally choose standard deviation (sigma) like you. I think using sigma is fine after equilibration because sigma is only valid on equilibrated ensemble. On the other hand, I still have the 1/tau which is also not size-consistent and I don't have a better solution now.

jtkrogel · 2017-06-01T13:29:06Z

@ye-luo So right now we take the min of two quantities, one O(N) and the other O(1) to bound a distribution that grows as O(sqrt(N)). I would think this would lead to either stability problems or bias in the large N limit, which makes me more eager to try something like the HEG tests I mentioned above.

ye-luo · 2017-06-01T13:38:40Z

@jtkrogel I agree that there is clearly a question on the optimal scheme. ~~In my particular test runs, O(sqrt(N)) introduced a significant bias but O(1) didn't due to the large value.~~ Do you think my change will cause something bad or it is generally an improvement though not solve the problem fully?

jtkrogel · 2017-06-01T13:58:22Z

@ye-luo I do not think your changes will affect things adversely, since, as I understand, they are limited mostly to issues of continuity from warmup to production and stability in warmup itself. The rest of my comments should probably be relegated to a discussion beyond this PR.

+1 on the changes provided all tests pass. In that light, can you post here the output from qmca -q ev -e XXX on the DMC files pre- and post- changes?

Separately, is it worth making a github issue noting needed exploration of N-scaling behavior of the projector?

ye-luo · 2017-06-01T14:12:01Z

@jtkrogel It is definitely worth creating a separate issue to discuss and track this topic.

Before change
NiO-fcc-S16-dmc series 2 -5941.024936 +/- 0.018722 209.987836 +/- 0.174539 0.0353
After change, even with tight pop control after warmup
NiO-fcc-S16-dmc series 2 -5941.016523 +/- 0.017340 210.205282 +/- 0.182502 0.0354

I couldn't run S32 long, it is too expensive. But I saw the equilibration is much shorter after the change.

jtkrogel · 2017-06-01T14:19:00Z

S16 is 64 atom cell, right? In this case there is no statistically significant change down to 1 mHa per NiO formula unit, which is good.

All ctests pass?

ye-luo · 2017-06-01T14:28:09Z

@jtkrogel Yes. 64 atom cell.
All the tests pass except one. That test was too short see issue #241. After making it 4x longer, it passes.

jtkrogel · 2017-06-01T14:28:33Z

+1 on the changes for me.

lshulen · 2017-06-01T16:34:37Z

I'm fine with the changes. It would be good to point out how they relate to the ideas in Zen et al. Phys. Rev. B 93, 241118(R)

ye-luo · 2017-06-01T21:45:48Z

hold on

1) Fix an inconsistency in branch cutoff in resetRun. 2) Adjust cutoff definition back to min(targetSigma*sigma,2.5/tau) The old code used sqrt(sigma) assuming sigma is variance but the defintion was switching back and forth between standard deviation(sigma) and variance.

ye-luo · 2017-06-02T03:43:11Z

UPDATE:
I carefully read the code again and found that the warmup plateau was cause by a wrong initial branch cutoff by sqrt(vParam[B_SIGMA)].
The old code sometimes assign variance to vParam[B_SIGMA] and sometimes assign sqrt(variance).
So after cleaning and aligning all vParam[B_SIGMA] to vParam[B_SIGMA2]=variance, I decide to put the branch cutoff as min( max(sqrt(variance)*targetsigma , 50) , 2.5/tau)

I repeated the NiO calculation and got
NiO-fcc-S16-dmc series 2 -5941.034585 +/- 0.019889 209.422210 +/- 0.330107 0.0353
After warmup, the cutoff is 144 which is smaller than 250 (2.5/0.01) in the old code.

prckent · 2017-06-02T06:49:37Z

@lshulen We need a switch between the different schemes.
@ye-luo I suggest updating the manual to document this scheme.

The scheme is cutoff=sqrt(variance)*min(targetSigma, sqrt(1.0/tau))

ye-luo · 2017-06-02T14:45:46Z

@lshulen @jtkrogel I read the paper and think they are going to the right direction but not using optimal scheme.

The paper suggests alpha*sqrt(Nelec/tau) as a cutoff. Based on my estimation, it is too small for the warm up but may be fine for the main stage. alpha=0.2 is empirical and Nelec is not a good reference quantity.

I think mine is better. The equation is
cutoff=sqrt(variance)*min(targetSigma, sqrt(1.0/tau))
targetSigma has a default value 10.0, namely a cutoff applied beyond 10 sigma.
We know that when we scale up a system, the variance is proportional to energy (also Nelec), it captures the feature from that paper. Since the energy is more native to the calculation and can distinguish different scales between all electron and pseudopotential calculations, it is a better order parameter.

Now, tau=0.1 you have a control of sqrt(10) sigma. When you have tau=0.01, you get 10 sigma bound.
When tau<0.01, remain 10 sigma as your targetSigma=10 specifies.
That paper shows smaller impact from time step. From what I understood, it is the time step bias from the branching filter but not on the integral. So in my case, if M sigma is good enough, your tau should just choose to be smaller than (1/M)^2 to avoid branch cutoff bias. Very simple to calculate.

prckent · 2017-06-02T14:53:57Z

@lshulen , @jtkrogel I think that @ye-luo 's discussion points should be a topic of a research paper and not a patch to our production code. I am not keen to have an unpublished algorithm in the production code, but we do need to solve the equilibration issue. What do you prefer to do in the interim?

@ye-luo Can we split these issues?

prckent · 2017-06-02T14:56:59Z

I am also not keen that we can not run the old algorithm after this patch.

I think that this algorithmic change will have to be switchable.

lshulen · 2017-06-02T14:57:47Z

I'm perfectly happy to have unpublished algorithms in the production code provided they are documented and are an option. In my opinion, the best case would be to add an attribute to the DMC tag that allows one to switch from the current algorithm to Ye's to Dario's. Luke

…

On 06/02/2017 08:53 AM, Paul R. C. Kent wrote: @lshulen , @jtkrogel I think that @ye-luo 's discussion points should be a topic of a research paper and not a patch to our production code. I am not keen to have an unpublished algorithm in the production code, but we do need to solve the equilibration issue. What do you prefer to do in the interim? @ye-luo Can we split these issues?

ye-luo · 2017-06-02T15:25:02Z

@prckent I prefer to have old default scheme(some fixed is necessary), Andrea's and mine switchable and we can investigate more. I had too many code branches to maintain and better to just get this in.

jtkrogel · 2017-06-02T15:25:46Z

I am very much in favor of switchability with the current branch cutoff scheme as default (production part, I understand the warmup part is being modified here to match production).

I think switching the default requires broader research, perhaps rising to the level of a paper. We could pool ideas and pursue this if there is interest.

prckent · 2017-06-02T16:22:09Z

Decision: Make switchable to preserve the "classic" QMCPACK algorithm for DMC. Add new schemes as non-default options, whether research or published (Zen/Alfe). The algorithm switch can encode both population control and green's function related changes.

@ye-luo I suggest you update this PR to keep the changes with this discussion.

BranchingCutoffScheme = classic(default), UMR, ZSGMA, YL

ye-luo · 2017-06-02T19:02:15Z

@prckent I added a switch allowing changing branching cutoff schemes. Default is classic. No manual update at this point and should be added when we have more experience with them.
population control is not switch-able. Warmup part algorithm is changed but after warmup part is not, the behaviour can be controlled by the already existing feedback option.

prckent · 2017-06-04T17:05:59Z

Ye - After all these updates, please can you clearly summarize what has been changed or not?

If no options are set, what is done in both warmup and production? From the comments above, changes in equilibration protocol are OK, but production algorithm should not be changed.
If any of the new BranchingCutoffScheme parameters ("classic","YL","ZSGSMA","UNR") to DMC are set? What is done in equilibration and what in production?

Please also add a comment to each section of setBranchCutoff documenting where the formulae are from - UNR paper equation #, you, classic algorithm implemented in QMCPACK,...

I am prepared to merge these changes as a development feature provided no one objects on Monday. Research is needed in these different settings, and we need a Zen/Alfe implementation regardless.

ye-luo · 2017-06-05T03:44:54Z

Final changes:

Use E_now instead of E_ref as the initial sample in energy histogram after warm up. E_ref measured during warmup introduces a bias.
Fuse multiple definition of min and max number of walkers per node. Use a single function.
The old code sometimes assign variance (most of time) to vParam[B_SIGMA] and sometimes assign sqrt(variance). Now replace all vParam[B_SIGMA] to vParam[B_SIGMA2]=variance.
due to 3), the formulae to compute the branch cutoff is not clearly defined. Now the code consistently use 1 formulae before and after warmup.
More branch cutoff value formulae are added and accessed by BranchingCutoffScheme ("classic","YL","ZSGMA","UNR") parameter. The default "classic" remains as the old production stage formulae. Using this option, you might see using a larger cutoff during warmup but main stage cutoff should remain the same as before.
Change the formulae calculating trial energy during warmup
from: E_trial=E_now+feedback*(log(pop_target/pop_now))
to: E_trial=E_now+1/tau*(log(pop_target/pop_now))
Warmup now use a tight population control to prevent population explosion or distinction.
The main stage formulae remain untouched as
E_trial=E_now+feedback*(log(pop_target/pop_now))

In summary, using the new code with an input file from an old DMC calculation. You will see:
During warmup, the branch cutoff is equal or larger. The equilibration goes faster if the branch cutoff is larger and your system is large. The population fluctuates much less.
After warmup, the branch cutoff remain the same as the old code and the population fluctuates the same as the old code.

prckent · 2017-06-05T16:42:02Z

OK, I will summarize this in the release changelog as

Updated equilibration algorithm for DMC with tighter population control
Updated initial energy at start of production DMC to reduce bias
BranchingCutoffScheme flag added to DMC for experimental changes of branching algorithm
No changes to final values of DMC observables expected, but statistics should be improved for large runs.

prckent

+1

jtkrogel · 2017-06-05T17:24:20Z

I know this is closed now, but I missed the flag name. Camel case primarily appeared in QMCPACK's input in its early history (e.g. timeStep), and was largely replaced in favor of flat text (e.g. timestep) later on as it is more forgiving for users (closer to natural language than CC C++ convention).

@ye-luo Can we continue this trend here in the interest of avoiding user input errors, e.g. use "branch_cutoff" instead?

prckent · 2017-06-05T17:41:09Z

OK, agree. Ye can you make a new PR? Lets use branch_cutoff

ye-luo · 2017-06-05T17:51:23Z

Sure. I will do branching_cutoff_scheme which matches the internal variable.
branch_cutoff sounds like setting the cutoff.

ye-luo added 7 commits January 28, 2017 14:09

Set min max walker in one place.

956a07e

The lines in WalkeControlFactory are replaced by calling setMinMax in the WalkerControlBase. Cleanup WalkerControlFactory.

Fuse branch cutoff definition.

8e8d091

Update branch cutoff at every resetRun.

e34c6c6

Merge branch 'develop' into branch-cutoff-try

e4a286d

ye-luo mentioned this pull request Jun 1, 2017

Restart/checkpoint memory limits with hdf5 #2

Closed

Paul-St-Young requested a review from berrill June 1, 2017 03:57

prckent removed the request for review from berrill June 1, 2017 06:23

prckent requested review from jtkrogel and lshulen June 1, 2017 06:26

ye-luo mentioned this pull request Jun 1, 2017

Improvements to DMC branch cutoff scheme including size consistency #245

Closed

prckent changed the title ~~Branch cutoff~~ Modify DMC branch cutoff and population control Jun 2, 2017

Update size-consistent branch cutoff scheme.

5f23f97

The scheme is cutoff=sqrt(variance)*min(targetSigma, sqrt(1.0/tau))

Add a switch to select branching cutoff scheme.

def3178

BranchingCutoffScheme = classic(default), UMR, ZSGMA, YL

Add eq ref to branch cutoff schemes.

b5f5894

prckent self-requested a review June 5, 2017 16:44

prckent approved these changes Jun 5, 2017

View reviewed changes

prckent merged commit f374e18 into QMCPACK:develop Jun 5, 2017

tiihonej mentioned this pull request Aug 28, 2019

Fix AA fixed Coulomb energy observable #1841

Merged

Modify DMC branch cutoff and population control #242

Modify DMC branch cutoff and population control #242

Conversation

ye-luo commented May 31, 2017 • edited Loading

qmc-robot commented May 31, 2017

qmc-robot commented May 31, 2017

prckent commented Jun 1, 2017

prckent commented Jun 1, 2017

jtkrogel commented Jun 1, 2017 • edited Loading

jtkrogel commented Jun 1, 2017

ye-luo commented Jun 1, 2017 • edited Loading

ye-luo commented Jun 1, 2017 • edited Loading

jtkrogel commented Jun 1, 2017

ye-luo commented Jun 1, 2017

ye-luo commented Jun 1, 2017 • edited Loading

ye-luo commented Jun 1, 2017 • edited Loading

jtkrogel commented Jun 1, 2017

ye-luo commented Jun 1, 2017 • edited Loading

jtkrogel commented Jun 1, 2017

ye-luo commented Jun 1, 2017 • edited Loading

jtkrogel commented Jun 1, 2017

ye-luo commented Jun 1, 2017

jtkrogel commented Jun 1, 2017

lshulen commented Jun 1, 2017

ye-luo commented Jun 1, 2017

ye-luo commented Jun 2, 2017

prckent commented Jun 2, 2017 • edited Loading

ye-luo commented Jun 2, 2017

prckent commented Jun 2, 2017

prckent commented Jun 2, 2017

lshulen commented Jun 2, 2017 via email

ye-luo commented Jun 2, 2017

jtkrogel commented Jun 2, 2017

prckent commented Jun 2, 2017 • edited Loading

ye-luo commented Jun 2, 2017

prckent commented Jun 4, 2017

ye-luo commented Jun 5, 2017 • edited Loading

prckent commented Jun 5, 2017 • edited Loading

prckent left a comment

Choose a reason for hiding this comment

jtkrogel commented Jun 5, 2017

prckent commented Jun 5, 2017

ye-luo commented Jun 5, 2017 • edited Loading

ye-luo commented May 31, 2017 •

edited

Loading

jtkrogel commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

ye-luo commented Jun 1, 2017 •

edited

Loading

prckent commented Jun 2, 2017 •

edited

Loading

prckent commented Jun 2, 2017 •

edited

Loading

ye-luo commented Jun 5, 2017 •

edited

Loading

prckent commented Jun 5, 2017 •

edited

Loading

ye-luo commented Jun 5, 2017 •

edited

Loading