Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in DMC of the complex version on GPUs #219

Closed
maxamsler opened this issue May 20, 2017 · 39 comments
Closed

Possible bug in DMC of the complex version on GPUs #219

maxamsler opened this issue May 20, 2017 · 39 comments
Labels

Comments

@maxamsler
Copy link

maxamsler commented May 20, 2017

I believe I found a bug in QMCPack when using the complex version compiled for GPUs. The energies obtained with a DMC simulation in a periodic system with twists do not agree with the values obtained with the CPU version. This issue arises both with the double and mixed precision version of the GPU code. I have attached the nexus file for a test case of a Si64 supercell. The twist averaged DMC energy for the CPU version results in:

qmca -a -q ev -e 50 silicon/dmccpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.106622 +/- 0.000573 1.820444 +/- 0.000643 0.0072

This energy value is significantly lower than the value obtained from the GPU version:
qmca -a -q ev -e 50 silicon/dmcgpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.084855 +/- 0.000572 1.792653 +/- 0.000634 0.0071

The bug does NOT seem to affect VMC calculations, since both CPU and GPU versions give results within the statistical error from each other.

The calculations were performed on Daint at the Swiss National Supercomputing center CSCS, a Cray XC50 with a NVIDIA Tesla P100 with 16GB per node. The following modules were loaded to compile QMCPack:
module switch PrgEnv-cray/6.0.3 PrgEnv-intel
module switch intel/17.0.1.132 intel/16.0.3.210
module load cray-hdf5
module load daint-gpu
module load cudatoolkit

I configured the make as follows:
cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DQE_BIN=/project/s700/QMCPack/CPU/qmcpack/external_codes/quantum_espresso/espresso-5.3.0/bin -DQMC_COMPLEX=1 -DLibxml2_INCLUDE_DIRS=/project/s700/local/include/libxml2/ -DLIBXML2_LIBRARIES=/project/s700/local/ -DBOOST_INCLUDEDIR=/project/s700/boost_1_61_0/ -DQMC_INCLUDE="-I/project/s700/local/include/libxml2" -DQMC_EXTRA_LIBS="/project/s700/local/lib/libxml2.a /project/s700/local/lib/liblzma.a" -DQMC_CUDA=1 -DQMC_MIXED_PRECISION=0

Si64.zip

@prckent
Copy link
Contributor

prckent commented May 24, 2017

We are currently trying to reproduce this with smaller runs. As a first step #224 adds DMC tests for LiH at several twists/k-points in both locality and t-moves. We will run these nightly from now on, enabling comparisons between the CPU and GPU implementations, observing of any trends in reliability etc.

@prckent
Copy link
Contributor

prckent commented May 24, 2017

The LiH tests pass on GPU (Kepler w. Xeon host) with locality and t-moves and the standard mixed precision settings, indicating that this is a more subtle problem than "just" the DMC driver being broken.

@maxamsler
Copy link
Author

maxamsler commented May 24, 2017

I also observed the same discrepancy in simple cubic phosphorus with 8 atoms, which is a smaller test case. If you want I can provide the reference data and input files.

@prckent
Copy link
Contributor

prckent commented May 24, 2017

Yes, that would help.

@maxamsler
Copy link
Author

I have attached the nexus workflow together with the ".dat" files from the qmc calculations. The plot shows how the vmc/dmc energies as a function of excluded data for equilibration, for the CPU and GPU version of the code. The data for the plot and the gnuplot script are also attached in a separate tar file.
plot.tar.gz
P_SC.tar.gz

plot

@atillack
Copy link

Hello Max,

I've been going over your data and the one commonality that also leads to an offset in energy when I use our smaller LiH DMC tests is the usage of MPC in the Hamiltonian. I wonder if you could retest one set (CPU and GPU) with it off in your configuration files and if this fixes it.

Thanks,
Andreas

@ye-luo
Copy link
Contributor

ye-luo commented May 25, 2017

@atillack the energy value (local energy) reported by qmca doesn't include MPC correction. MPC correction is only an auxiliary Hamiltonian and doesn't affect sampling.

@prckent prckent added this to the v3.1.0 Release milestone May 26, 2017
@atillack
Copy link

@ye-luo You're right. I reran the LiH tests with fixed random numbers and what I saw earlier (difference with MPC on, none with it off) went away. Wether MPC is used or not did not make a difference.

At least for the LiH DMC test, CPU and GPU complex runs agree within their error bars.

@maxamsler
Copy link
Author

Ok. Has anyone tried to run the P-SC structure? Should run very quickly

@atillack
Copy link

@maxamsler I would like to. Could you please post a wave function and configuration file to save me the time compiling Quantum Espresso (on the P100 machine I am at it is being hard to compile right now).

@atillack
Copy link

atillack commented Jun 2, 2017

@maxamsler The Nexus workflow you provided in your P_SC tarball does not work for me and fails with the following error:

Crystal error: the variable constants must be provided exiting.

Could you please provide the wave function and configuration file?

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 5, 2017

The problem with the workflow is a missing POSCAR file for the atomic structure. Perhaps Max (@maxamsler) can add it here.

@prckent
Copy link
Contributor

prckent commented Jun 5, 2017

@jtkrogel Would be good for NEXUS to catch this.
@maxamsler We need the structure.

@maxamsler
Copy link
Author

maxamsler commented Jun 6, 2017

Sorry, I forgot to attach the POSCAR.

POSCAR.tar.gz
Here it is! Let me know if you need anything else.

@atillack
Copy link

atillack commented Jun 6, 2017

@maxamsler Thank you.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 6, 2017

@maxamsler I've tried to reproduce the initial parts of your workflow (orbital generation, jastrow optimization).

I get the following for the scf total energy. Can you confirm this is what you have?
psi1>grep '! ' scf.out
! total energy = -104.92442028 Ry

For optimization, I get the following:
psi1>qmca -q ev *scalar*
LocalEnergy Variance ratio
opt series 0 -52.974221 +/- 0.011113 0.795427 +/- 0.017385 0.0150
opt series 1 -53.058415 +/- 0.009073 0.623527 +/- 0.013509 0.0118
opt series 2 -53.081506 +/- 0.012316 0.601191 +/- 0.015149 0.0113
opt series 3 -53.066947 +/- 0.008743 0.660168 +/- 0.010379 0.0124
opt series 4 -53.077711 +/- 0.010984 0.633892 +/- 0.011716 0.0119
opt series 5 -53.075013 +/- 0.007704 0.637929 +/- 0.009462 0.0120
opt series 6 -53.068671 +/- 0.005155 0.636694 +/- 0.009226 0.0120
opt series 7 -53.085528 +/- 0.006919 0.623939 +/- 0.008297 0.0118

The supercell twist used in the optimization was:
Using supercell twist 0: [ -0.12500 -0.12500 -0.12500]

We can proceed with the Jastrow I produced, but it would be better to have yours. Can you post a single input file, e.g. dmc-8at_444/dmc.g000.twistnum_0.in.xml?

@maxamsler
Copy link
Author

@jtkrogel Thank you. I get very similar results:
! total energy = -104.92442028 Ry

And:
opt series 0 -52.975754 +/- 0.013494 0.821427 +/- 0.015755 0.0155
opt series 1 -53.075402 +/- 0.004519 0.630866 +/- 0.007337 0.0119
opt series 2 -53.074534 +/- 0.005798 0.636403 +/- 0.005169 0.0120
opt series 3 -53.076695 +/- 0.005168 0.631304 +/- 0.005981 0.0119
opt series 4 -53.070720 +/- 0.004128 0.629416 +/- 0.004843 0.0119
opt series 5 -53.072013 +/- 0.003839 0.641991 +/- 0.004586 0.0121
opt series 6 -53.070086 +/- 0.004683 0.632449 +/- 0.003970 0.0119
opt series 7 -53.071325 +/- 0.003409 0.637983 +/- 0.004869 0.0120

The dmc-8at_444/dmc.g000.twistnum_0.in.xml file is attached here:
dmc.g000.twistnum_0.in.xml.zip

I just realized that the workflow to produce the initial plot with the large difference between GPU and CPU was slightly different from the Nexus script I posted earlier. With the script posted above I get a smaller discrepancy between CPU and GPU, but still significant, as shown in the following plot:
plot

However, in the workflow where the discrepancy was much larger, I used a small kgrid of 2x2x2 for the Jastrow optimization, which I then fed into the DMC calculations with a 3x3x3 and a 4x4x4 grid. Since the Jastrows/orbitals for both the GPU and CPU runs were identical I did not expect that it mattered, or am I wrong? It seems that the discrepancy between GPU and CPU depends on the Jastrow. Here is the workflow that optimizes the Jastrow on a 2x2x2 and runs QMC with a 3x3x3/4x4x4 grid, together with the Jastrow that I used for the DMC runs.
P_SC_loop.py.zip
dmc.g000.twistnum_0.in.xml.zip

@prckent
Copy link
Contributor

prckent commented Jun 14, 2017

We have found what looks to be a definite bug in the DMC GPU code and its handling of certain twists. A DMC run with an x-point twist reliably gives a different result to the CPU code. https://cdash.qmcpack.org/CDash/testDetails.php?test=506898&build=7683 The equivalent VMC run is OK and the CPU and GPU runs are consistent. This may not be the same problem as the bug discussed here, but there could well be some relation.

@jtkrogel
Copy link
Contributor

Does this bug manifest in real and complex gpu, or complex only?

@prckent
Copy link
Contributor

prckent commented Jun 14, 2017

This "new" bug affects only the real build of the GPU code. The newly written complex code is correct https://cdash.qmcpack.org/CDash/testDetails.php?test=507363&build=7687

@atillack
Copy link

@prckent @jtkrogel We should probably open a new bug for this. Attached find a plot of the differences for the LiH-x-short runs:

results-lih-short-x

The real wave function GPU runs are consistently off while the CPU and GPU (complex) runs are consistent with each other and the reference value.

@atillack
Copy link

@maxamsler Here is a (slightly hard to read) plot of your phosphorus example data at each twist number comparing CPU and GPU runs overlaid with reruns (for good measure four times the number of blocks, 400, and with a four times smaller time step of 0.01) of the same wave functions at select twists (every four) to spot check. For your runs I averaged over the last 30 blocks using qmca and for the four times longer spot checks over the last 120.

results_psc

Long story short, while I do see the occasional data point being off by more than 1 sigma (2 sigma is the most I've seen) there is nothing I would consider dramatically off.

@jtkrogel
Copy link
Contributor

@atillack Can you also post the twist averaged final numbers (qmca -a ...) in text format for CPU/GPU, Max/Rerun?

@atillack
Copy link

atillack commented Jun 15, 2017

@jtkrogel Great suggestion, I think we're getting somewhere... Let me explain:

Here are the values for the last 30 blocks for Max' data:
Overall CPU:
avg series 1 -52.837026 +/- 0.000367 0.622537 +/- 0.000317 0.0118
Overall GPU:
avg series 1 -52.835842 +/- 0.000392 0.620309 +/- 0.000297 0.0117

And for the rerun for the last 120:
Overall CPU:
avg series 1 -53.016886 +/- 0.000634 0.622187 +/- 0.000545 0.0117
Overall GPU:
avg series 1 -53.017832 +/- 0.000646 0.622489 +/- 0.000557 0.0117

Keep in mind that for the rerun I am only averaging over every fourth twist.

This is interesting as the calculated errors by qmca are about an order of magnitude smaller compared to the errors reported for the individual twists, for Max' data:

CPU (twist #0): -53.196303 +/- 0.002710 0.619067 +/- 0.002518 0.0116
GPU (twist #0): -53.197061 +/- 0.001960 0.613733 +/- 0.002728 0.0115

test_results_max.txt

For the reruns:
CPU (twist #0): -53.200358 +/- 0.003109 0.619121 +/- 0.002440 0.0116
GPU (twist #0): -53.197160 +/- 0.002041 0.620312 +/- 0.003893 0.0117

test_results_rerun.txt

How does that explain anything you might ask... What qmca is doing is it calculates the error by assuming all data points come from the same distribution, in other words, the variance drops by 1/n (n being the number of samples).

When averaging over different twists with a complex wave function this assumption does not hold anymore (see the plot I posted earlier, with very discrete energy levels) and hence the error obtained with qmca is overly optimistic. Off the cuff, a simple correction (as such likely too naive) for the phosphorus case with four energy levels could be to multiply qmca's error with sqrt(4)...

@jtkrogel
Copy link
Contributor

So there are two potential issues:

  1. The error bars may be estimated poorly due to the low number of blocks involved. The errorbar has an errorbar and I would recommend using upwards of 200 blocks to calculate it correctly.

  2. The use of gaussian statistics in general. qmca, and all other qmc statistical processing tools I am aware of, make the partially poor assumption of gaussian statistics. This is true for single point or twist averaged calculations, but as you point out in the twist averaged case, the averaging itself makes an additional reliance on gaussian statistics (gaussian variance averaging formula), which may further expose the weakness of this assumption.

One way to correct for the additional assumption embedded in 2) for the case of twist averaging is to directly twist average the data files and then perform the statistics (difficulties arise when the files differ in length). I have an older tool that does this. @atillack If you post all the relevant scalar.dat files here, I will investigate this difference (though the insufficient blocks problem (1) may remain).

@prckent
Copy link
Contributor

prckent commented Jun 15, 2017

@atillack Very interesting that there is no big difference at the individual twist level. Are you certain? Looking at the plot, I can't be certain that there is not a missing point or one out of place.

@atillack
Copy link

atillack commented Jun 15, 2017

@jtkrogel Here are all the *.scalar.dat files from Max and my rerun:

scalar_dat_files.zip

@prckent You can look into the two text files I posted containing the data the plot is made from. The largest deviation between CPU and GPU I see in Max' data is for twist 34 with a 3 sigma deviation. In the rerun it is for twist 32 with a 2 sigma deviation.

P.S. Just to be clear, when I write x sigma I mean x times the reported error bar by qmca.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 15, 2017

OK, here's my summary.

40 blocks (Max's data, post equilibration) is too few to calculate a reliable error bar for 2 reasons: 1) large variation in error bar due to too few blocks, 2) more importantly, too few blocks leads to an underestimated autocorrelation time which then leads to an underestimated errorbar. Overall, there is no (statistically) significant evidence for a difference between cpu and gpu runs.

Max's results (twist averaged, last 40 blocks):
maui>qmca -a -e 60 -q e dmc.*s001*scalar* --sac
avg series 1 LocalEnergy = -52.836690 +/- 0.000445 2.5
maui>qmca -a -e 60 -q e dmcgpu.*s001*scalar* --sac
avg series 1 LocalEnergy = -52.835522 +/- 0.000510 3.1

difference: -0.0012 +/- 0.0007 (1.5 "sigma", underestimated error bar)
autocorrelation time: ~3 blocks

Andrea's longer runs (twist averaged, last 200 blocks)
maui>qmca -a -e 200 -q e *cpu_long*s001*scalar* --sac
avg series 1 LocalEnergy = -53.016785 +/- 0.000804 7.5
maui>qmca -a -e 200 -q e *gpu_long*s001*scalar* --sac
avg series 1 LocalEnergy = -53.018113 +/- 0.000841 8.5

difference: 0.0013 +/- 0.0011 (1 sigma, or essentially zero)
autocorrelation time: ~8 blocks

In either set of runs, the VMC data is too short (20 blocks) to get a meaningful errorbar.

On point 2) of my previous post, twist averaged error bars seem to be estimated accurately under the current assumptions:

Twist averaged error bar for Andreas' cpu runs: 0.000804
Amplify this by sqrt(17) (17 files in the average): 0.003315
Average over individual error bars for each twist: 0.003207

In other words, the errorbar expected for a single twist (based on the TA estimate), matches the observed behavior of a single twist.

@jtkrogel
Copy link
Contributor

Forgot to mention, averaging full file data then doing stats gives an error bar of 0.000789, while estimating the errorbar via the variance averaging formula gives 0.000804.

@prckent
Copy link
Contributor

prckent commented Jun 15, 2017

@maxamsler For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine.

@prckent
Copy link
Contributor

prckent commented Jun 16, 2017

Will reopen if necessary.

@prckent prckent closed this as completed Jun 16, 2017
@maxamsler
Copy link
Author

maxamsler commented Jun 16, 2017 via email

@maxamsler
Copy link
Author

maxamsler commented Jun 16, 2017 via email

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 19, 2017

@maxamsler I would like to take a closer look at your 800 block output files. If you could post them here, that would be great (*.scalar.dat and *.dmc.dat files for all series as well as input files).

@maxamsler
Copy link
Author

Here are the files, incl. the plot.

plot eps

800blocks_dat_gpu.tar.gz
800blocks_dat_cpu.tar.gz

@jtkrogel
Copy link
Contributor

@maxamsler Earlier I missed the fact that your runs were performed with timestep 0.04 while Andreas' runs used 0.01. This alone explains the difference between the autocorrelation times.

We will rerun at 0.04 to see if we reproduce the difference you see; it is possible that this reflects differing timestep behavior between the cpu and gpu implementations. If so, moving to a smaller timestep (~0.01) might resolve your issue (implementations are correct so long as they agree in the zero timestep limit).

@maxamsler
Copy link
Author

Here are the results for runs with time step of 0.01. This time, I also increased the number of blocks to 2000. Nevertheless, I get a significant discrepancy in the energies.
plot eps

I had to split the data into smaller chunks:
data_2000.tar.gz.partag.txt
data_2000.tar.gz.partaf.txt
data_2000.tar.gz.partae.txt
data_2000.tar.gz.partad.txt
data_2000.tar.gz.partac.txt
data_2000.tar.gz.partab.txt
data_2000.tar.gz.partaa.txt

@prckent prckent changed the title Bug in DMC of the complex version on GPUs Possible bug in DMC of the complex version on GPUs Jul 7, 2017
@prckent
Copy link
Contributor

prckent commented Jul 7, 2017

Changed title since the current hypothesis is that this problem is probably statistical. It could also be related to the CSCS machine (software, Pascal?). For a single run the twists are individually close enough when run on OLCF Kepler machines. Multiple runs to obtain robust statistics and twist by twist comparisons could shed more light.

@prckent
Copy link
Contributor

prckent commented Nov 28, 2017

I am going to close this issue since we believe this to be statistical problem intrinsic to the setup of the twists in this case and not a bug in QMCPACK. i.e. It is a hazard of statistical methods. Our comparisons between the CPU and GPU runs for individual twists agree. We can reopen this issue if needed. It does point to the need for better statistical analysis procedures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants