-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug in DMC of the complex version on GPUs #219
Comments
We are currently trying to reproduce this with smaller runs. As a first step #224 adds DMC tests for LiH at several twists/k-points in both locality and t-moves. We will run these nightly from now on, enabling comparisons between the CPU and GPU implementations, observing of any trends in reliability etc. |
The LiH tests pass on GPU (Kepler w. Xeon host) with locality and t-moves and the standard mixed precision settings, indicating that this is a more subtle problem than "just" the DMC driver being broken. |
I also observed the same discrepancy in simple cubic phosphorus with 8 atoms, which is a smaller test case. If you want I can provide the reference data and input files. |
Yes, that would help. |
I have attached the nexus workflow together with the ".dat" files from the qmc calculations. The plot shows how the vmc/dmc energies as a function of excluded data for equilibration, for the CPU and GPU version of the code. The data for the plot and the gnuplot script are also attached in a separate tar file. |
Hello Max, I've been going over your data and the one commonality that also leads to an offset in energy when I use our smaller LiH DMC tests is the usage of MPC in the Hamiltonian. I wonder if you could retest one set (CPU and GPU) with it off in your configuration files and if this fixes it. Thanks, |
@atillack the energy value (local energy) reported by qmca doesn't include MPC correction. MPC correction is only an auxiliary Hamiltonian and doesn't affect sampling. |
@ye-luo You're right. I reran the LiH tests with fixed random numbers and what I saw earlier (difference with MPC on, none with it off) went away. Wether MPC is used or not did not make a difference. At least for the LiH DMC test, CPU and GPU complex runs agree within their error bars. |
Ok. Has anyone tried to run the P-SC structure? Should run very quickly |
@maxamsler I would like to. Could you please post a wave function and configuration file to save me the time compiling Quantum Espresso (on the P100 machine I am at it is being hard to compile right now). |
@maxamsler The Nexus workflow you provided in your P_SC tarball does not work for me and fails with the following error:
Could you please provide the wave function and configuration file? |
The problem with the workflow is a missing POSCAR file for the atomic structure. Perhaps Max (@maxamsler) can add it here. |
@jtkrogel Would be good for NEXUS to catch this. |
Sorry, I forgot to attach the POSCAR. POSCAR.tar.gz |
@maxamsler Thank you. |
@maxamsler I've tried to reproduce the initial parts of your workflow (orbital generation, jastrow optimization). I get the following for the scf total energy. Can you confirm this is what you have? For optimization, I get the following: The supercell twist used in the optimization was: We can proceed with the Jastrow I produced, but it would be better to have yours. Can you post a single input file, e.g. dmc-8at_444/dmc.g000.twistnum_0.in.xml? |
@jtkrogel Thank you. I get very similar results: And: The dmc-8at_444/dmc.g000.twistnum_0.in.xml file is attached here: I just realized that the workflow to produce the initial plot with the large difference between GPU and CPU was slightly different from the Nexus script I posted earlier. With the script posted above I get a smaller discrepancy between CPU and GPU, but still significant, as shown in the following plot: However, in the workflow where the discrepancy was much larger, I used a small kgrid of 2x2x2 for the Jastrow optimization, which I then fed into the DMC calculations with a 3x3x3 and a 4x4x4 grid. Since the Jastrows/orbitals for both the GPU and CPU runs were identical I did not expect that it mattered, or am I wrong? It seems that the discrepancy between GPU and CPU depends on the Jastrow. Here is the workflow that optimizes the Jastrow on a 2x2x2 and runs QMC with a 3x3x3/4x4x4 grid, together with the Jastrow that I used for the DMC runs. |
We have found what looks to be a definite bug in the DMC GPU code and its handling of certain twists. A DMC run with an x-point twist reliably gives a different result to the CPU code. https://cdash.qmcpack.org/CDash/testDetails.php?test=506898&build=7683 The equivalent VMC run is OK and the CPU and GPU runs are consistent. This may not be the same problem as the bug discussed here, but there could well be some relation. |
Does this bug manifest in real and complex gpu, or complex only? |
This "new" bug affects only the real build of the GPU code. The newly written complex code is correct https://cdash.qmcpack.org/CDash/testDetails.php?test=507363&build=7687 |
@maxamsler Here is a (slightly hard to read) plot of your phosphorus example data at each twist number comparing CPU and GPU runs overlaid with reruns (for good measure four times the number of blocks, 400, and with a four times smaller time step of 0.01) of the same wave functions at select twists (every four) to spot check. For your runs I averaged over the last 30 blocks using qmca and for the four times longer spot checks over the last 120. Long story short, while I do see the occasional data point being off by more than 1 sigma (2 sigma is the most I've seen) there is nothing I would consider dramatically off. |
@atillack Can you also post the twist averaged final numbers (qmca -a ...) in text format for CPU/GPU, Max/Rerun? |
@jtkrogel Great suggestion, I think we're getting somewhere... Let me explain: Here are the values for the last 30 blocks for Max' data: And for the rerun for the last 120: Keep in mind that for the rerun I am only averaging over every fourth twist. This is interesting as the calculated errors by qmca are about an order of magnitude smaller compared to the errors reported for the individual twists, for Max' data: CPU (twist #0): -53.196303 +/- 0.002710 0.619067 +/- 0.002518 0.0116 For the reruns: How does that explain anything you might ask... What qmca is doing is it calculates the error by assuming all data points come from the same distribution, in other words, the variance drops by 1/n (n being the number of samples). When averaging over different twists with a complex wave function this assumption does not hold anymore (see the plot I posted earlier, with very discrete energy levels) and hence the error obtained with qmca is overly optimistic. Off the cuff, a simple correction (as such likely too naive) for the phosphorus case with four energy levels could be to multiply qmca's error with sqrt(4)... |
So there are two potential issues:
One way to correct for the additional assumption embedded in 2) for the case of twist averaging is to directly twist average the data files and then perform the statistics (difficulties arise when the files differ in length). I have an older tool that does this. @atillack If you post all the relevant scalar.dat files here, I will investigate this difference (though the insufficient blocks problem (1) may remain). |
@atillack Very interesting that there is no big difference at the individual twist level. Are you certain? Looking at the plot, I can't be certain that there is not a missing point or one out of place. |
@jtkrogel Here are all the *.scalar.dat files from Max and my rerun: @prckent You can look into the two text files I posted containing the data the plot is made from. The largest deviation between CPU and GPU I see in Max' data is for twist 34 with a 3 sigma deviation. In the rerun it is for twist 32 with a 2 sigma deviation. P.S. Just to be clear, when I write x sigma I mean x times the reported error bar by qmca. |
OK, here's my summary. 40 blocks (Max's data, post equilibration) is too few to calculate a reliable error bar for 2 reasons: 1) large variation in error bar due to too few blocks, 2) more importantly, too few blocks leads to an underestimated autocorrelation time which then leads to an underestimated errorbar. Overall, there is no (statistically) significant evidence for a difference between cpu and gpu runs. Max's results (twist averaged, last 40 blocks): difference: -0.0012 +/- 0.0007 (1.5 "sigma", underestimated error bar) Andrea's longer runs (twist averaged, last 200 blocks) difference: 0.0013 +/- 0.0011 (1 sigma, or essentially zero) In either set of runs, the VMC data is too short (20 blocks) to get a meaningful errorbar. On point 2) of my previous post, twist averaged error bars seem to be estimated accurately under the current assumptions: Twist averaged error bar for Andreas' cpu runs: 0.000804 In other words, the errorbar expected for a single twist (based on the TA estimate), matches the observed behavior of a single twist. |
Forgot to mention, averaging full file data then doing stats gives an error bar of 0.000789, while estimating the errorbar via the variance averaging formula gives 0.000804. |
@maxamsler For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine. |
Will reopen if necessary. |
Thanks a lot. I am running now with 300 blocks to see if the error bars decrease.
Best
Max
… On 15 Jun 2017, at 17:19, Paul R. C. Kent ***@***.*** ***@***.***>> wrote:
@maxamsler <https://github.com/maxamsler> For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#219 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJFqytJL7_0FE8zthSenbekk0alw7FScks5sEZ_QgaJpZM4NhJcA>.
|
Hi Paul,
I just redid the calculations using 800 blocks, and I still get some discrepancies. However, I did not change how the errors were computed. Should I multiply the errorbars with sqrt(ntwist)? Otherwise, it seems to be an issue with my computer cluster.
Best
Max
… On 15 Jun 2017, at 17:19, Paul R. C. Kent ***@***.***> wrote:
@maxamsler <https://github.com/maxamsler> For now, our conclusion is that this is not a bug but a consequence of too few blocks in your runs. This gives unreliable error bars (and error of error bars). We suggest you try, e.g. 200 blocks leaving the step and walker count unchanged and the problem should go away. Please let us know. If the problem remains, then we will revisit it. The last uncontrolled variable is the CSCS machine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#219 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJFqytJL7_0FE8zthSenbekk0alw7FScks5sEZ_QgaJpZM4NhJcA>.
|
@maxamsler I would like to take a closer look at your 800 block output files. If you could post them here, that would be great (*.scalar.dat and *.dmc.dat files for all series as well as input files). |
Here are the files, incl. the plot. |
@maxamsler Earlier I missed the fact that your runs were performed with timestep 0.04 while Andreas' runs used 0.01. This alone explains the difference between the autocorrelation times. We will rerun at 0.04 to see if we reproduce the difference you see; it is possible that this reflects differing timestep behavior between the cpu and gpu implementations. If so, moving to a smaller timestep (~0.01) might resolve your issue (implementations are correct so long as they agree in the zero timestep limit). |
Here are the results for runs with time step of 0.01. This time, I also increased the number of blocks to 2000. Nevertheless, I get a significant discrepancy in the energies. I had to split the data into smaller chunks: |
Changed title since the current hypothesis is that this problem is probably statistical. It could also be related to the CSCS machine (software, Pascal?). For a single run the twists are individually close enough when run on OLCF Kepler machines. Multiple runs to obtain robust statistics and twist by twist comparisons could shed more light. |
I am going to close this issue since we believe this to be statistical problem intrinsic to the setup of the twists in this case and not a bug in QMCPACK. i.e. It is a hazard of statistical methods. Our comparisons between the CPU and GPU runs for individual twists agree. We can reopen this issue if needed. It does point to the need for better statistical analysis procedures. |
I believe I found a bug in QMCPack when using the complex version compiled for GPUs. The energies obtained with a DMC simulation in a periodic system with twists do not agree with the values obtained with the CPU version. This issue arises both with the double and mixed precision version of the GPU code. I have attached the nexus file for a test case of a Si64 supercell. The twist averaged DMC energy for the CPU version results in:
qmca -a -q ev -e 50 silicon/dmccpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.106622 +/- 0.000573 1.820444 +/- 0.000643 0.0072
This energy value is significantly lower than the value obtained from the GPU version:
qmca -a -q ev -e 50 silicon/dmcgpu_64_444/*001.scalar.dat
LocalEnergy Variance ratio
avg series 1 -252.084855 +/- 0.000572 1.792653 +/- 0.000634 0.0071
The bug does NOT seem to affect VMC calculations, since both CPU and GPU versions give results within the statistical error from each other.
The calculations were performed on Daint at the Swiss National Supercomputing center CSCS, a Cray XC50 with a NVIDIA Tesla P100 with 16GB per node. The following modules were loaded to compile QMCPack:
module switch PrgEnv-cray/6.0.3 PrgEnv-intel
module switch intel/17.0.1.132 intel/16.0.3.210
module load cray-hdf5
module load daint-gpu
module load cudatoolkit
I configured the make as follows:
cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DQE_BIN=/project/s700/QMCPack/CPU/qmcpack/external_codes/quantum_espresso/espresso-5.3.0/bin -DQMC_COMPLEX=1 -DLibxml2_INCLUDE_DIRS=/project/s700/local/include/libxml2/ -DLIBXML2_LIBRARIES=/project/s700/local/ -DBOOST_INCLUDEDIR=/project/s700/boost_1_61_0/ -DQMC_INCLUDE="-I/project/s700/local/include/libxml2" -DQMC_EXTRA_LIBS="/project/s700/local/lib/libxml2.a /project/s700/local/lib/liblzma.a" -DQMC_CUDA=1 -DQMC_MIXED_PRECISION=0
Si64.zip
The text was updated successfully, but these errors were encountered: