Possible bug in DMC of the complex version on GPUs #219

maxamsler · 2017-05-20T01:07:29Z

I believe I found a bug in QMCPack when using the complex version compiled for GPUs. The energies obtained with a DMC simulation in a periodic system with twists do not agree with the values obtained with the CPU version. This issue arises both with the double and mixed precision version of the GPU code. I have attached the nexus file for a test case of a Si64 supercell. The twist averaged DMC energy for the CPU version results in:

qmca -a -q ev -e 50 silicon/dmccpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.106622 +/- 0.000573 1.820444 +/- 0.000643 0.0072

This energy value is significantly lower than the value obtained from the GPU version:
qmca -a -q ev -e 50 silicon/dmcgpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.084855 +/- 0.000572 1.792653 +/- 0.000634 0.0071

The bug does NOT seem to affect VMC calculations, since both CPU and GPU versions give results within the statistical error from each other.

The calculations were performed on Daint at the Swiss National Supercomputing center CSCS, a Cray XC50 with a NVIDIA Tesla P100 with 16GB per node. The following modules were loaded to compile QMCPack:
module switch PrgEnv-cray/6.0.3 PrgEnv-intel
module switch intel/17.0.1.132 intel/16.0.3.210
module load cray-hdf5
module load daint-gpu
module load cudatoolkit

I configured the make as follows:
cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DQE_BIN=/project/s700/QMCPack/CPU/qmcpack/external_codes/quantum_espresso/espresso-5.3.0/bin -DQMC_COMPLEX=1 -DLibxml2_INCLUDE_DIRS=/project/s700/local/include/libxml2/ -DLIBXML2_LIBRARIES=/project/s700/local/ -DBOOST_INCLUDEDIR=/project/s700/boost_1_61_0/ -DQMC_INCLUDE="-I/project/s700/local/include/libxml2" -DQMC_EXTRA_LIBS="/project/s700/local/lib/libxml2.a /project/s700/local/lib/liblzma.a" -DQMC_CUDA=1 -DQMC_MIXED_PRECISION=0

Si64.zip

prckent · 2017-05-24T01:51:23Z

We are currently trying to reproduce this with smaller runs. As a first step #224 adds DMC tests for LiH at several twists/k-points in both locality and t-moves. We will run these nightly from now on, enabling comparisons between the CPU and GPU implementations, observing of any trends in reliability etc.

prckent · 2017-05-24T17:32:42Z

The LiH tests pass on GPU (Kepler w. Xeon host) with locality and t-moves and the standard mixed precision settings, indicating that this is a more subtle problem than "just" the DMC driver being broken.

maxamsler · 2017-05-24T19:23:10Z

I also observed the same discrepancy in simple cubic phosphorus with 8 atoms, which is a smaller test case. If you want I can provide the reference data and input files.

prckent · 2017-05-24T20:05:58Z

Yes, that would help.

maxamsler · 2017-05-24T21:28:07Z

I have attached the nexus workflow together with the ".dat" files from the qmc calculations. The plot shows how the vmc/dmc energies as a function of excluded data for equilibration, for the CPU and GPU version of the code. The data for the plot and the gnuplot script are also attached in a separate tar file.
plot.tar.gz
P_SC.tar.gz

atillack · 2017-05-25T15:18:46Z

Hello Max,

I've been going over your data and the one commonality that also leads to an offset in energy when I use our smaller LiH DMC tests is the usage of MPC in the Hamiltonian. I wonder if you could retest one set (CPU and GPU) with it off in your configuration files and if this fixes it.

Thanks,
Andreas

ye-luo · 2017-05-25T15:28:59Z

@atillack the energy value (local energy) reported by qmca doesn't include MPC correction. MPC correction is only an auxiliary Hamiltonian and doesn't affect sampling.

atillack · 2017-05-26T15:26:24Z

@ye-luo You're right. I reran the LiH tests with fixed random numbers and what I saw earlier (difference with MPC on, none with it off) went away. Wether MPC is used or not did not make a difference.

At least for the LiH DMC test, CPU and GPU complex runs agree within their error bars.

maxamsler · 2017-05-26T15:38:40Z

Ok. Has anyone tried to run the P-SC structure? Should run very quickly

atillack · 2017-05-26T15:46:39Z

@maxamsler I would like to. Could you please post a wave function and configuration file to save me the time compiling Quantum Espresso (on the P100 machine I am at it is being hard to compile right now).

atillack · 2017-06-02T18:06:08Z

@maxamsler The Nexus workflow you provided in your P_SC tarball does not work for me and fails with the following error:

Crystal error: the variable constants must be provided exiting.

Could you please provide the wave function and configuration file?

jtkrogel · 2017-06-05T15:22:31Z

The problem with the workflow is a missing POSCAR file for the atomic structure. Perhaps Max (@maxamsler) can add it here.

prckent · 2017-06-05T17:42:36Z

@jtkrogel Would be good for NEXUS to catch this.
@maxamsler We need the structure.

maxamsler · 2017-06-06T04:45:21Z

Sorry, I forgot to attach the POSCAR.

POSCAR.tar.gz
Here it is! Let me know if you need anything else.

atillack · 2017-06-06T11:25:34Z

@maxamsler Thank you.

jtkrogel · 2017-06-06T15:15:12Z

@maxamsler I've tried to reproduce the initial parts of your workflow (orbital generation, jastrow optimization).

I get the following for the scf total energy. Can you confirm this is what you have?
psi1>grep '! ' scf.out
! total energy = -104.92442028 Ry

For optimization, I get the following:
psi1>qmca -q ev *scalar*
LocalEnergy Variance ratio
opt series 0 -52.974221 +/- 0.011113 0.795427 +/- 0.017385 0.0150
opt series 1 -53.058415 +/- 0.009073 0.623527 +/- 0.013509 0.0118
opt series 2 -53.081506 +/- 0.012316 0.601191 +/- 0.015149 0.0113
opt series 3 -53.066947 +/- 0.008743 0.660168 +/- 0.010379 0.0124
opt series 4 -53.077711 +/- 0.010984 0.633892 +/- 0.011716 0.0119
opt series 5 -53.075013 +/- 0.007704 0.637929 +/- 0.009462 0.0120
opt series 6 -53.068671 +/- 0.005155 0.636694 +/- 0.009226 0.0120
opt series 7 -53.085528 +/- 0.006919 0.623939 +/- 0.008297 0.0118

The supercell twist used in the optimization was:
Using supercell twist 0: [ -0.12500 -0.12500 -0.12500]

We can proceed with the Jastrow I produced, but it would be better to have yours. Can you post a single input file, e.g. dmc-8at_444/dmc.g000.twistnum_0.in.xml?

maxamsler · 2017-06-09T22:24:03Z

@jtkrogel Thank you. I get very similar results:
! total energy = -104.92442028 Ry

And:
opt series 0 -52.975754 +/- 0.013494 0.821427 +/- 0.015755 0.0155
opt series 1 -53.075402 +/- 0.004519 0.630866 +/- 0.007337 0.0119
opt series 2 -53.074534 +/- 0.005798 0.636403 +/- 0.005169 0.0120
opt series 3 -53.076695 +/- 0.005168 0.631304 +/- 0.005981 0.0119
opt series 4 -53.070720 +/- 0.004128 0.629416 +/- 0.004843 0.0119
opt series 5 -53.072013 +/- 0.003839 0.641991 +/- 0.004586 0.0121
opt series 6 -53.070086 +/- 0.004683 0.632449 +/- 0.003970 0.0119
opt series 7 -53.071325 +/- 0.003409 0.637983 +/- 0.004869 0.0120

The dmc-8at_444/dmc.g000.twistnum_0.in.xml file is attached here:
dmc.g000.twistnum_0.in.xml.zip

I just realized that the workflow to produce the initial plot with the large difference between GPU and CPU was slightly different from the Nexus script I posted earlier. With the script posted above I get a smaller discrepancy between CPU and GPU, but still significant, as shown in the following plot:

However, in the workflow where the discrepancy was much larger, I used a small kgrid of 2x2x2 for the Jastrow optimization, which I then fed into the DMC calculations with a 3x3x3 and a 4x4x4 grid. Since the Jastrows/orbitals for both the GPU and CPU runs were identical I did not expect that it mattered, or am I wrong? It seems that the discrepancy between GPU and CPU depends on the Jastrow. Here is the workflow that optimizes the Jastrow on a 2x2x2 and runs QMC with a 3x3x3/4x4x4 grid, together with the Jastrow that I used for the DMC runs.
P_SC_loop.py.zip
dmc.g000.twistnum_0.in.xml.zip

prckent · 2017-06-14T13:41:51Z

We have found what looks to be a definite bug in the DMC GPU code and its handling of certain twists. A DMC run with an x-point twist reliably gives a different result to the CPU code. https://cdash.qmcpack.org/CDash/testDetails.php?test=506898&build=7683 The equivalent VMC run is OK and the CPU and GPU runs are consistent. This may not be the same problem as the bug discussed here, but there could well be some relation.

jtkrogel · 2017-06-14T14:06:31Z

Does this bug manifest in real and complex gpu, or complex only?

prckent · 2017-06-14T14:48:44Z

This "new" bug affects only the real build of the GPU code. The newly written complex code is correct https://cdash.qmcpack.org/CDash/testDetails.php?test=507363&build=7687

atillack · 2017-06-15T15:07:38Z

@prckent @jtkrogel We should probably open a new bug for this. Attached find a plot of the differences for the LiH-x-short runs:

The real wave function GPU runs are consistently off while the CPU and GPU (complex) runs are consistent with each other and the reference value.

atillack · 2017-06-15T15:58:36Z

@maxamsler Here is a (slightly hard to read) plot of your phosphorus example data at each twist number comparing CPU and GPU runs overlaid with reruns (for good measure four times the number of blocks, 400, and with a four times smaller time step of 0.01) of the same wave functions at select twists (every four) to spot check. For your runs I averaged over the last 30 blocks using qmca and for the four times longer spot checks over the last 120.

Long story short, while I do see the occasional data point being off by more than 1 sigma (2 sigma is the most I've seen) there is nothing I would consider dramatically off.

jtkrogel · 2017-06-15T16:42:07Z

@atillack Can you also post the twist averaged final numbers (qmca -a ...) in text format for CPU/GPU, Max/Rerun?

atillack · 2017-06-15T17:16:56Z

@jtkrogel Great suggestion, I think we're getting somewhere... Let me explain:

Here are the values for the last 30 blocks for Max' data:
Overall CPU:
avg series 1 -52.837026 +/- 0.000367 0.622537 +/- 0.000317 0.0118
Overall GPU:
avg series 1 -52.835842 +/- 0.000392 0.620309 +/- 0.000297 0.0117

And for the rerun for the last 120:
Overall CPU:
avg series 1 -53.016886 +/- 0.000634 0.622187 +/- 0.000545 0.0117
Overall GPU:
avg series 1 -53.017832 +/- 0.000646 0.622489 +/- 0.000557 0.0117

Keep in mind that for the rerun I am only averaging over every fourth twist.

This is interesting as the calculated errors by qmca are about an order of magnitude smaller compared to the errors reported for the individual twists, for Max' data:

CPU (twist #0): -53.196303 +/- 0.002710 0.619067 +/- 0.002518 0.0116
GPU (twist #0): -53.197061 +/- 0.001960 0.613733 +/- 0.002728 0.0115

test_results_max.txt

For the reruns:
CPU (twist #0): -53.200358 +/- 0.003109 0.619121 +/- 0.002440 0.0116
GPU (twist #0): -53.197160 +/- 0.002041 0.620312 +/- 0.003893 0.0117

test_results_rerun.txt

How does that explain anything you might ask... What qmca is doing is it calculates the error by assuming all data points come from the same distribution, in other words, the variance drops by 1/n (n being the number of samples).

When averaging over different twists with a complex wave function this assumption does not hold anymore (see the plot I posted earlier, with very discrete energy levels) and hence the error obtained with qmca is overly optimistic. Off the cuff, a simple correction (as such likely too naive) for the phosphorus case with four energy levels could be to multiply qmca's error with sqrt(4)...

jtkrogel · 2017-06-15T17:55:29Z

So there are two potential issues:

The error bars may be estimated poorly due to the low number of blocks involved. The errorbar has an errorbar and I would recommend using upwards of 200 blocks to calculate it correctly.
The use of gaussian statistics in general. qmca, and all other qmc statistical processing tools I am aware of, make the partially poor assumption of gaussian statistics. This is true for single point or twist averaged calculations, but as you point out in the twist averaged case, the averaging itself makes an additional reliance on gaussian statistics (gaussian variance averaging formula), which may further expose the weakness of this assumption.

One way to correct for the additional assumption embedded in 2) for the case of twist averaging is to directly twist average the data files and then perform the statistics (difficulties arise when the files differ in length). I have an older tool that does this. @atillack If you post all the relevant scalar.dat files here, I will investigate this difference (though the insufficient blocks problem (1) may remain).

prckent · 2017-06-15T18:02:24Z

@atillack Very interesting that there is no big difference at the individual twist level. Are you certain? Looking at the plot, I can't be certain that there is not a missing point or one out of place.

atillack · 2017-06-15T18:14:23Z

@jtkrogel Here are all the *.scalar.dat files from Max and my rerun:

scalar_dat_files.zip

@prckent You can look into the two text files I posted containing the data the plot is made from. The largest deviation between CPU and GPU I see in Max' data is for twist 34 with a 3 sigma deviation. In the rerun it is for twist 32 with a 2 sigma deviation.

P.S. Just to be clear, when I write x sigma I mean x times the reported error bar by qmca.

jtkrogel · 2017-06-15T19:38:39Z

OK, here's my summary.

40 blocks (Max's data, post equilibration) is too few to calculate a reliable error bar for 2 reasons: 1) large variation in error bar due to too few blocks, 2) more importantly, too few blocks leads to an underestimated autocorrelation time which then leads to an underestimated errorbar. Overall, there is no (statistically) significant evidence for a difference between cpu and gpu runs.

Max's results (twist averaged, last 40 blocks):
maui>qmca -a -e 60 -q e dmc.*s001*scalar* --sac
avg series 1 LocalEnergy = -52.836690 +/- 0.000445 2.5
maui>qmca -a -e 60 -q e dmcgpu.*s001*scalar* --sac
avg series 1 LocalEnergy = -52.835522 +/- 0.000510 3.1

difference: -0.0012 +/- 0.0007 (1.5 "sigma", underestimated error bar)
autocorrelation time: ~3 blocks

Andrea's longer runs (twist averaged, last 200 blocks)
maui>qmca -a -e 200 -q e *cpu_long*s001*scalar* --sac
avg series 1 LocalEnergy = -53.016785 +/- 0.000804 7.5
maui>qmca -a -e 200 -q e *gpu_long*s001*scalar* --sac
avg series 1 LocalEnergy = -53.018113 +/- 0.000841 8.5

difference: 0.0013 +/- 0.0011 (1 sigma, or essentially zero)
autocorrelation time: ~8 blocks

In either set of runs, the VMC data is too short (20 blocks) to get a meaningful errorbar.

On point 2) of my previous post, twist averaged error bars seem to be estimated accurately under the current assumptions:

Twist averaged error bar for Andreas' cpu runs: 0.000804
Amplify this by sqrt(17) (17 files in the average): 0.003315
Average over individual error bars for each twist: 0.003207

In other words, the errorbar expected for a single twist (based on the TA estimate), matches the observed behavior of a single twist.

jtkrogel · 2017-06-15T20:15:44Z

Forgot to mention, averaging full file data then doing stats gives an error bar of 0.000789, while estimating the errorbar via the variance averaging formula gives 0.000804.

prckent · 2017-06-15T21:19:11Z

@maxamsler For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine.

prckent · 2017-06-16T17:49:47Z

Will reopen if necessary.

maxamsler · 2017-06-16T19:05:07Z

Thanks a lot. I am running now with 300 blocks to see if the error bars decrease. Best Max

…

On 15 Jun 2017, at 17:19, Paul R. C. Kent ***@***.*** ***@***.***>> wrote: @maxamsler <https://github.com/maxamsler> For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#219 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJFqytJL7_0FE8zthSenbekk0alw7FScks5sEZ_QgaJpZM4NhJcA>.

maxamsler · 2017-06-16T22:05:26Z

Hi Paul, I just redid the calculations using 800 blocks, and I still get some discrepancies. However, I did not change how the errors were computed. Should I multiply the errorbars with sqrt(ntwist)? Otherwise, it seems to be an issue with my computer cluster. Best Max

…

On 15 Jun 2017, at 17:19, Paul R. C. Kent ***@***.***> wrote: @maxamsler <https://github.com/maxamsler> For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#219 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJFqytJL7_0FE8zthSenbekk0alw7FScks5sEZ_QgaJpZM4NhJcA>.

jtkrogel · 2017-06-19T13:07:39Z

@maxamsler I would like to take a closer look at your 800 block output files. If you could post them here, that would be great (*.scalar.dat and *.dmc.dat files for all series as well as input files).

maxamsler · 2017-06-19T17:23:37Z

Here are the files, incl. the plot.

800blocks_dat_gpu.tar.gz
800blocks_dat_cpu.tar.gz

jtkrogel · 2017-06-20T12:08:55Z

@maxamsler Earlier I missed the fact that your runs were performed with timestep 0.04 while Andreas' runs used 0.01. This alone explains the difference between the autocorrelation times.

We will rerun at 0.04 to see if we reproduce the difference you see; it is possible that this reflects differing timestep behavior between the cpu and gpu implementations. If so, moving to a smaller timestep (~0.01) might resolve your issue (implementations are correct so long as they agree in the zero timestep limit).

maxamsler · 2017-06-21T18:24:18Z

Here are the results for runs with time step of 0.01. This time, I also increased the number of blocks to 2000. Nevertheless, I get a significant discrepancy in the energies.

I had to split the data into smaller chunks:
data_2000.tar.gz.partag.txt
data_2000.tar.gz.partaf.txt
data_2000.tar.gz.partae.txt
data_2000.tar.gz.partad.txt
data_2000.tar.gz.partac.txt
data_2000.tar.gz.partab.txt
data_2000.tar.gz.partaa.txt

prckent · 2017-07-07T14:06:12Z

Changed title since the current hypothesis is that this problem is probably statistical. It could also be related to the CSCS machine (software, Pascal?). For a single run the twists are individually close enough when run on OLCF Kepler machines. Multiple runs to obtain robust statistics and twist by twist comparisons could shed more light.

prckent · 2017-11-28T15:20:27Z

I am going to close this issue since we believe this to be statistical problem intrinsic to the setup of the twists in this case and not a bug in QMCPACK. i.e. It is a hazard of statistical methods. Our comparisons between the CPU and GPU runs for individual twists agree. We can reopen this issue if needed. It does point to the need for better statistical analysis procedures.

prckent added the bug label May 20, 2017

prckent mentioned this issue May 23, 2017

Add DMC tests and reference data for LiH solid at Gamma/X/Arb twists #224

Merged

prckent added this to the v3.1.0 Release milestone May 26, 2017

jtkrogel mentioned this issue Jun 12, 2017

nexus: better guard for structure file and internal consistency #262

Merged

prckent closed this as completed Jun 16, 2017

prckent reopened this Jun 16, 2017

prckent mentioned this issue Jun 17, 2017

Test VMC with drift at non-gamma twist that real code can still run #270

Merged

ye-luo mentioned this issue Jun 18, 2017

Fix LiH x-point bug. #272

Merged

prckent modified the milestones: v3.2.0 Release, v3.1.0 Release Jun 19, 2017

prckent changed the title ~~Bug in DMC of the complex version on GPUs~~ Possible bug in DMC of the complex version on GPUs Jul 7, 2017

prckent modified the milestones: v3.3.0 Release, v3.2.0 Release Aug 29, 2017

prckent closed this as completed Nov 28, 2017

Hyeondeok-Shin mentioned this issue Feb 5, 2019

Deterministic tests for 'LiH_dimer_pp' #1356

Merged

Possible bug in DMC of the complex version on GPUs #219

Possible bug in DMC of the complex version on GPUs #219

Comments

maxamsler commented May 20, 2017 • edited Loading

prckent commented May 24, 2017

prckent commented May 24, 2017

maxamsler commented May 24, 2017 • edited Loading

prckent commented May 24, 2017

maxamsler commented May 24, 2017

atillack commented May 25, 2017

ye-luo commented May 25, 2017

atillack commented May 26, 2017

maxamsler commented May 26, 2017

atillack commented May 26, 2017

atillack commented Jun 2, 2017

jtkrogel commented Jun 5, 2017

prckent commented Jun 5, 2017

maxamsler commented Jun 6, 2017 • edited Loading

atillack commented Jun 6, 2017

jtkrogel commented Jun 6, 2017 • edited Loading

maxamsler commented Jun 9, 2017

prckent commented Jun 14, 2017

jtkrogel commented Jun 14, 2017

prckent commented Jun 14, 2017

atillack commented Jun 15, 2017

atillack commented Jun 15, 2017

jtkrogel commented Jun 15, 2017

atillack commented Jun 15, 2017 • edited Loading

jtkrogel commented Jun 15, 2017

prckent commented Jun 15, 2017 • edited Loading

atillack commented Jun 15, 2017 • edited Loading

jtkrogel commented Jun 15, 2017 • edited Loading

jtkrogel commented Jun 15, 2017

prckent commented Jun 15, 2017

prckent commented Jun 16, 2017

maxamsler commented Jun 16, 2017 via email

maxamsler commented Jun 16, 2017 via email

jtkrogel commented Jun 19, 2017 • edited Loading

maxamsler commented Jun 19, 2017

jtkrogel commented Jun 20, 2017

maxamsler commented Jun 21, 2017

prckent commented Jul 7, 2017

prckent commented Nov 28, 2017

maxamsler commented May 20, 2017 •

edited

Loading

maxamsler commented May 24, 2017 •

edited

Loading

maxamsler commented Jun 6, 2017 •

edited

Loading

jtkrogel commented Jun 6, 2017 •

edited

Loading

atillack commented Jun 15, 2017 •

edited

Loading

prckent commented Jun 15, 2017 •

edited

Loading

atillack commented Jun 15, 2017 •

edited

Loading

jtkrogel commented Jun 15, 2017 •

edited

Loading

jtkrogel commented Jun 19, 2017 •

edited

Loading