Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify DMC branch cutoff and population control #242

Merged
merged 11 commits into from
Jun 5, 2017

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented May 31, 2017

All the short DMC tests pass except short-bccH_1x1x1_ae-dmc_sdj-1-16. See issue #241

A few significant changes:

  1. branch cutoff is now based on variance instead of standard deviation to be size-consistent. This change affects warmup. After the warmup, the cutoff has ALREADY been determined by variance.
  2. fuse multiple definition of branch cutoff formula.
  3. Rename vParam[B_SIGMA)] to vParam[B_SIGMA2] and assign variance all the time.
  4. fuse multiple definition of min and max number of walkers per node.
  5. use tight population control (E_trial with the 1/tau term) during DMC warm up to prevent population explosion or distinction.
  6. Use E_now instead of E_ref as the initial sample in energy histogram.

Population control (E_trial) in DMC after warm up has NOT been changed.
Whether to use tight control will be decided with more study on the population bias.

Clean up the inconsistent defintion of B_SIGMA in SimpleFixedNodeBranch. Renamed to B_SIGMA2 instead.
Variance(sigma2) is size-consistent but standard deviation(sigma) is not.
The inconsistency caused a bug that the inital branch cutoff calulated from sigma severely limits the branching and biases the warmup.
After the warmup, the first data point put in the histogram was using the histogram mean of the warmup part which biases the stage collecting statistics.
Instead use the ensemble average of energy from the last iteration E_now as the first data point.
1, In the E_trial, add back the 1/tau prefactor in the population control term only in warmup. Now the population flucturation is much smaller. After warmup, keep the old scheme.
2, Modify doNotBranch to return the new population after the first iteration for adjusting the E_trial in order to prevent the intial populaion explosion in DMC.
The lines in WalkeControlFactory are replaced by calling setMinMax in the WalkerControlBase.
Cleanup WalkerControlFactory.
@qmc-robot
Copy link

Can one of the maintainers verify this patch?

1 similar comment
@qmc-robot
Copy link

Can one of the maintainers verify this patch?

@prckent
Copy link
Contributor

prckent commented Jun 1, 2017

ok to test

@prckent prckent removed the request for review from berrill June 1, 2017 06:23
@prckent
Copy link
Contributor

prckent commented Jun 1, 2017

This is a potentially significant change and improvement to the DMC algorithm.

  1. Have you run the long DMC tests with this patch?
  2. Verified that a large production run is unbiased with this change? e.g. One of the TiO2 ones. Please put the qmca output here.

@prckent prckent requested review from jtkrogel and lshulen June 1, 2017 06:26
@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

I am all for more daylight on the projector (particularly its behavior with varying N and/or Z). We particularly need detailed documentation on exactly what the core DMC algorithm is doing, including the "special sauce".

Testing in the large N limit is warranted; sadly it is in exactly this limit that we will have a hard time telling what is correct (the old version, the new version, or neither?). Since we are moving strongly in this direction (very large N) on ECP, it is probably time to have some strict correctness checks of the algorithm in this limit.

One possibility would be to solve the NI HEG Hamiltonian (KE only, exact nodes) using a bogus Jastrow factor and let DMC project it away. This could provide insight into the old vs new algorithm as a function of N (and also timestep).

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

@ye-luo Could you explain a little more about the change to variance rather than standard deviation? The standard deviation is not size consistent (i.e. does not scale as N) as you say, but I thought the point of the branch cutoff was to filter out the troublesome rare event/heavy tails of the local energy distribution from causing instabilities in the branching weights.

From this point of view, it would make sense to have the cutoff be a meaningful representation of distance relative to the width of the local energy distribution. The standard deviation has this property (at least for non-heavy tailed distributions!) while the variance does not. I could see potentially replacing the standard deviation with a percentile based definition instead (avoids heavy tail problem), but I don't yet see how defining the local energy branch cutoff by the variance would give an improvement.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

Some measurements

Current code with bad cutoff scheme, loose pop control. There is a kink after warm up due to a bad initial value add to the energy historgram.
start

Old branch cutoff scheme, tight pop control on both warmup and after. The kink is gone but the plateau of the warm up part is still apparent.
old

New branch cutoff scheme, tight pop control on both warmup and after. The plateau is gone.
new

Note in that experiment, I used tight pop control also after warm up but not added in this commit. So you will still see a large population fluctuation but smooth energy after warmup.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel Copy paste from my slides
Current scheme was using the standard deviation during warmup but variance after warmup.
default
new branch cutoff choice
default

The cutoff is to secure the simulation and still effective. The 1/tau cutoff dominates when tau is large.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

I agree that pre- and post-warmup should use the same branch cutoff scheme (with the same N scaling). I just don't understand why it should be sigma^2 since the width of the local energy distribution scales like sqrt(N).

On the population control front, I wouldn't expect normal ("looser") control to be a problem so long as discontinuous changes from warmup to production are avoided. Do you have tests that use normal control throughout while including the other changes? I would be hesitant about tight control (1/(g*tau) with g=1, insists that population return to target each generation, on average) due to increased bias.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel I have updated the description. The branch cutoff scheme was the same after warmup. You were using it for years. The problem is in warmup and leads to more equlibration (a waste of time) after warmup.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel I had the same hesitation about the tight pop control for the population bias. For this reason, it is only turned on during warmup but not in the main stage and leave more investigation later.
Actually I didn't see energy difference using normal or tight population control throughout. I prefer tight control during warmup because of a much more stable starting population. The other changes in this PR doesn't help. I noticed that switching from VMC to DMC the population can vary drastically and often cause too many walks (out of memory) or too few walkers (some nodes have no walker and die). Even if a population after warmup was not too bad. it takes very long time for the loose pop control to recover the target population and cause large inefficiency, either long time to solution or idle threads.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel I cannot explain you why. I would naturally choose standard deviation (sigma) like you. I think using sigma is fine after equilibration because sigma is only valid on equilibrated ensemble. On the other hand, I still have the 1/tau which is also not size-consistent and I don't have a better solution now.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

@ye-luo So right now we take the min of two quantities, one O(N) and the other O(1) to bound a distribution that grows as O(sqrt(N)). I would think this would lead to either stability problems or bias in the large N limit, which makes me more eager to try something like the HEG tests I mentioned above.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel I agree that there is clearly a question on the optimal scheme. In my particular test runs, O(sqrt(N)) introduced a significant bias but O(1) didn't due to the large value. Do you think my change will cause something bad or it is generally an improvement though not solve the problem fully?

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

@ye-luo I do not think your changes will affect things adversely, since, as I understand, they are limited mostly to issues of continuity from warmup to production and stability in warmup itself. The rest of my comments should probably be relegated to a discussion beyond this PR.

+1 on the changes provided all tests pass. In that light, can you post here the output from qmca -q ev -e XXX on the DMC files pre- and post- changes?

Separately, is it worth making a github issue noting needed exploration of N-scaling behavior of the projector?

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel It is definitely worth creating a separate issue to discuss and track this topic.

Before change
NiO-fcc-S16-dmc series 2 -5941.024936 +/- 0.018722 209.987836 +/- 0.174539 0.0353
After change, even with tight pop control after warmup
NiO-fcc-S16-dmc series 2 -5941.016523 +/- 0.017340 210.205282 +/- 0.182502 0.0354

I couldn't run S32 long, it is too expensive. But I saw the equilibration is much shorter after the change.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

S16 is 64 atom cell, right? In this case there is no statistically significant change down to 1 mHa per NiO formula unit, which is good.

All ctests pass?

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

@jtkrogel Yes. 64 atom cell.
All the tests pass except one. That test was too short see issue #241. After making it 4x longer, it passes.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 1, 2017

+1 on the changes for me.

@lshulen
Copy link
Contributor

lshulen commented Jun 1, 2017

I'm fine with the changes. It would be good to point out how they relate to the ideas in Zen et al. Phys. Rev. B 93, 241118(R)

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 1, 2017

hold on

1) Fix an inconsistency in branch cutoff in resetRun.
2) Adjust cutoff definition back to min(targetSigma*sigma,2.5/tau)
The old code used sqrt(sigma) assuming sigma is variance but the defintion was switching back and forth between standard deviation(sigma) and variance.
@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 2, 2017

UPDATE:
I carefully read the code again and found that the warmup plateau was cause by a wrong initial branch cutoff by sqrt(vParam[B_SIGMA)].
The old code sometimes assign variance to vParam[B_SIGMA] and sometimes assign sqrt(variance).
So after cleaning and aligning all vParam[B_SIGMA] to vParam[B_SIGMA2]=variance, I decide to put the branch cutoff as min( max(sqrt(variance)*targetsigma , 50) , 2.5/tau)

I repeated the NiO calculation and got
NiO-fcc-S16-dmc series 2 -5941.034585 +/- 0.019889 209.422210 +/- 0.330107 0.0353
After warmup, the cutoff is 144 which is smaller than 250 (2.5/0.01) in the old code.

@prckent
Copy link
Contributor

prckent commented Jun 2, 2017

@lshulen We need a switch between the different schemes.
@ye-luo I suggest updating the manual to document this scheme.

@prckent prckent changed the title Branch cutoff Modify DMC branch cutoff and population control Jun 2, 2017
The scheme is cutoff=sqrt(variance)*min(targetSigma, sqrt(1.0/tau))
@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 2, 2017

@lshulen @jtkrogel I read the paper and think they are going to the right direction but not using optimal scheme.

The paper suggests alpha*sqrt(Nelec/tau) as a cutoff. Based on my estimation, it is too small for the warm up but may be fine for the main stage. alpha=0.2 is empirical and Nelec is not a good reference quantity.

I think mine is better. The equation is
cutoff=sqrt(variance)*min(targetSigma, sqrt(1.0/tau))
targetSigma has a default value 10.0, namely a cutoff applied beyond 10 sigma.
We know that when we scale up a system, the variance is proportional to energy (also Nelec), it captures the feature from that paper. Since the energy is more native to the calculation and can distinguish different scales between all electron and pseudopotential calculations, it is a better order parameter.

Now, tau=0.1 you have a control of sqrt(10) sigma. When you have tau=0.01, you get 10 sigma bound.
When tau<0.01, remain 10 sigma as your targetSigma=10 specifies.
That paper shows smaller impact from time step. From what I understood, it is the time step bias from the branching filter but not on the integral. So in my case, if M sigma is good enough, your tau should just choose to be smaller than (1/M)^2 to avoid branch cutoff bias. Very simple to calculate.

@prckent
Copy link
Contributor

prckent commented Jun 2, 2017

@lshulen , @jtkrogel I think that @ye-luo 's discussion points should be a topic of a research paper and not a patch to our production code. I am not keen to have an unpublished algorithm in the production code, but we do need to solve the equilibration issue. What do you prefer to do in the interim?

@ye-luo Can we split these issues?

@prckent
Copy link
Contributor

prckent commented Jun 2, 2017

I am also not keen that we can not run the old algorithm after this patch.

I think that this algorithmic change will have to be switchable.

@lshulen
Copy link
Contributor

lshulen commented Jun 2, 2017 via email

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 2, 2017

@prckent I prefer to have old default scheme(some fixed is necessary), Andrea's and mine switchable and we can investigate more. I had too many code branches to maintain and better to just get this in.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 2, 2017

I am very much in favor of switchability with the current branch cutoff scheme as default (production part, I understand the warmup part is being modified here to match production).

I think switching the default requires broader research, perhaps rising to the level of a paper. We could pool ideas and pursue this if there is interest.

@prckent
Copy link
Contributor

prckent commented Jun 2, 2017

Decision: Make switchable to preserve the "classic" QMCPACK algorithm for DMC. Add new schemes as non-default options, whether research or published (Zen/Alfe). The algorithm switch can encode both population control and green's function related changes.

@ye-luo I suggest you update this PR to keep the changes with this discussion.

BranchingCutoffScheme = classic(default), UMR, ZSGMA, YL
@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 2, 2017

@prckent I added a switch allowing changing branching cutoff schemes. Default is classic. No manual update at this point and should be added when we have more experience with them.
population control is not switch-able. Warmup part algorithm is changed but after warmup part is not, the behaviour can be controlled by the already existing feedback option.

@prckent
Copy link
Contributor

prckent commented Jun 4, 2017

Ye - After all these updates, please can you clearly summarize what has been changed or not?

  1. If no options are set, what is done in both warmup and production? From the comments above, changes in equilibration protocol are OK, but production algorithm should not be changed.
  2. If any of the new BranchingCutoffScheme parameters ("classic","YL","ZSGSMA","UNR") to DMC are set? What is done in equilibration and what in production?

Please also add a comment to each section of setBranchCutoff documenting where the formulae are from - UNR paper equation #, you, classic algorithm implemented in QMCPACK,...

I am prepared to merge these changes as a development feature provided no one objects on Monday. Research is needed in these different settings, and we need a Zen/Alfe implementation regardless.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 5, 2017

Final changes:

  1. Use E_now instead of E_ref as the initial sample in energy histogram after warm up. E_ref measured during warmup introduces a bias.
  2. Fuse multiple definition of min and max number of walkers per node. Use a single function.
  3. The old code sometimes assign variance (most of time) to vParam[B_SIGMA] and sometimes assign sqrt(variance). Now replace all vParam[B_SIGMA] to vParam[B_SIGMA2]=variance.
  4. due to 3), the formulae to compute the branch cutoff is not clearly defined. Now the code consistently use 1 formulae before and after warmup.
  5. More branch cutoff value formulae are added and accessed by BranchingCutoffScheme ("classic","YL","ZSGMA","UNR") parameter. The default "classic" remains as the old production stage formulae. Using this option, you might see using a larger cutoff during warmup but main stage cutoff should remain the same as before.
  6. Change the formulae calculating trial energy during warmup
    from: E_trial=E_now+feedback*(log(pop_target/pop_now))
    to: E_trial=E_now+1/tau*(log(pop_target/pop_now))
    Warmup now use a tight population control to prevent population explosion or distinction.
    The main stage formulae remain untouched as
    E_trial=E_now+feedback*(log(pop_target/pop_now))

In summary, using the new code with an input file from an old DMC calculation. You will see:
During warmup, the branch cutoff is equal or larger. The equilibration goes faster if the branch cutoff is larger and your system is large. The population fluctuates much less.
After warmup, the branch cutoff remain the same as the old code and the population fluctuates the same as the old code.

@prckent
Copy link
Contributor

prckent commented Jun 5, 2017

OK, I will summarize this in the release changelog as

  • Updated equilibration algorithm for DMC with tighter population control
  • Updated initial energy at start of production DMC to reduce bias
  • BranchingCutoffScheme flag added to DMC for experimental changes of branching algorithm
  • No changes to final values of DMC observables expected, but statistics should be improved for large runs.

@prckent prckent self-requested a review June 5, 2017 16:44
Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@prckent prckent merged commit f374e18 into QMCPACK:develop Jun 5, 2017
@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 5, 2017

I know this is closed now, but I missed the flag name. Camel case primarily appeared in QMCPACK's input in its early history (e.g. timeStep), and was largely replaced in favor of flat text (e.g. timestep) later on as it is more forgiving for users (closer to natural language than CC C++ convention).

@ye-luo Can we continue this trend here in the interest of avoiding user input errors, e.g. use "branch_cutoff" instead?

@prckent
Copy link
Contributor

prckent commented Jun 5, 2017

OK, agree. Ye can you make a new PR? Lets use branch_cutoff

@ye-luo
Copy link
Contributor Author

ye-luo commented Jun 5, 2017

Sure. I will do branching_cutoff_scheme which matches the internal variable.
branch_cutoff sounds like setting the cutoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants