Fix race conditions in CUDA kernels causing incorrect energies #2883

prckent · 2021-02-08T19:21:24Z

Proposed changes

Fix reported bugs by Andrea Zen / Dario Alfe where the legacy CUDA code will give clearly wrong energies for certain calculations. The symptom is a total energy as well as several subcomponents of the total energy that vary significantly with walker count. This problem was initially discovered via non-sensical DMC results, but the problems also occur in VMC.

Bugs are traced to handful of potential race conditions in the kernels associated with Coulomb energy evaluation. These bugs are present in Volta generation and more recent GPUs due to their independent thread scheduling. Our regular testing for e.g. carbon diamond did not identify them.

A side effect of these fixes is that the legacy CUDA code now appears to be completely deterministic.

Despite the extra syncthreads calls, running the performance tests showed no change in runtime within measurement error.

I do not have a small reproducer for the problems.

Thanks to @jefflarkin for reminders of CUDA tools (cuda-memcheck) that helped track these problems down efficiently.

The racecheck tool still issues a warning in find_core_electrons_PBC_kernel in NLPP.cu . It looks to be a false warning but needs more study. Otherwise the code is clean of races in VMC for the standard spline+j1+j2 inputs that I tried.

The initcheck tool flags a couple of uninitialized memory references associated with determinant setup and update.

What type(s) of changes does this code introduce?

Bugfix

Does this introduce a breaking change?

No

What systems has this change been tested on?

nitrogen: AMD Rome + NVIDIA V100 32GB. LLVM dev + CUDA 11.2.0

Checklist

Yes. This PR is up to date with current the current state of 'develop'
No, only CUDA changes. We have not reformatted the CUDA. Code added or changed in the PR has been clang-formatted.
NA. This PR adds tests to cover any new code, or to catch a bug that is being fixed
NA. Documentation has been added (if appropriate)

( Created the PR from within vscode, hence initial draft. )

ye-luo

Does this make deterministic tests passing more robustly?
https://cdash.qmcpack.org/CDash/viewTest.php?buildid=183717&onlydelta

ye-luo · 2021-02-08T19:25:40Z

src/QMCHamiltonians/NLPP.cu

@@ -312,6 +312,7 @@ __global__ void find_core_electrons_PBC_kernel(T** R,
        disp[tid][1] = r[tid][1] - i[ion][1];
        disp[tid][2] = r[tid][2] - i[ion][2];
        dist[tid]    = min_dist<T>(disp[tid][0], disp[tid][1], disp[tid][2], L, Linv);
+        __syncthreads();


doubt this really matters.

ye-luo · 2021-02-08T19:25:54Z

src/QMCWaveFunctions/detail/CUDA_legacy/determinant_update.cu

@@ -1412,6 +1412,7 @@ __global__ void calc_ratio_grad_lapl(T** Ainv_list,
  // ratio to make it w.r.t. new position
  if (tid < 4)
    ratio_prod[(tid + 1) * BS1] /= ratio_prod[0];
+  __syncthreads();  


doubt this really matter.

ye-luo · 2021-02-08T19:37:37Z

I think the bug is not releated to Volta independent thread scheduling. Deterministic tests are failing on my Pascal as well. I think more efficient warp scheduling tricks the bug.

The diamond test just has too low electron counts.

prckent · 2021-02-08T19:50:54Z

As you mention, some of the deterministic tests are failing. I don't think any of these changes will help with them.

An isolated water molecule in a periodic supercell tripped the problem. This has the same electron count as our diamond tests. However the wavefunction is large. Possibly we could make a workflow test out of it.

There are some excess syncthreads calls to try to quiet racecheck. Since they didn't slow the code noticeably and they make the code a bit easier to "parse" mentally for threading issues, I prefer to keep them.

prckent · 2021-02-08T19:54:46Z

Actually the diamondC_2x1x1 are working locally, see below. Hopefully the runs on bora will also improve:

[pk7@nitrogen build_llvmdev_legacy_cuda_mixed_precision]$ ctest -R deterministic-diamondC_2x1x1
Test project /home/pk7/projects/qmc/git_QMCPACK_prckent/build_llvmdev_legacy_cuda_mixed_precision
      Start 453: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1
 1/48 Test #453: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1 ..............................   Passed    1.69 sec
      Start 454: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-kinetic
 2/48 Test #454: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-kinetic ......................   Passed    0.03 sec
      Start 455: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-totenergy
 3/48 Test #455: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-totenergy ....................   Passed    0.03 sec
      Start 456: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-eeenergy
 4/48 Test #456: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-eeenergy .....................   Passed    0.03 sec
      Start 457: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-samples
 5/48 Test #457: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-samples ......................   Passed    0.03 sec
      Start 458: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-potential
 6/48 Test #458: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-potential ....................   Passed    0.03 sec
      Start 459: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-ionion
 7/48 Test #459: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-ionion .......................   Passed    0.03 sec
      Start 460: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-localecp
 8/48 Test #460: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-localecp .....................   Passed    0.03 sec
      Start 461: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-nonlocalecp
 9/48 Test #461: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-nonlocalecp ..................   Passed    0.03 sec
      Start 462: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-flux
10/48 Test #462: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-flux .........................   Passed    0.03 sec
      Start 463: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1
11/48 Test #463: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1 ......................   Passed    1.64 sec
      Start 464: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-kinetic
12/48 Test #464: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-kinetic ..............   Passed    0.04 sec
      Start 465: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-totenergy
13/48 Test #465: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-totenergy ............   Passed    0.03 sec
      Start 466: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-eeenergy
14/48 Test #466: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-eeenergy .............   Passed    0.04 sec
      Start 467: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-samples
15/48 Test #467: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-samples ..............   Passed    0.04 sec
      Start 468: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-potential
16/48 Test #468: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-potential ............   Passed    0.04 sec
      Start 469: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-ionion
17/48 Test #469: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-ionion ...............   Passed    0.04 sec
      Start 470: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-localecp
18/48 Test #470: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-localecp .............   Passed    0.03 sec
      Start 471: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp
19/48 Test #471: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp ..........   Passed    0.03 sec
      Start 472: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-flux
20/48 Test #472: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-flux .................   Passed    0.03 sec
      Start 473: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1
21/48 Test #473: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1 ...............   Passed    1.66 sec
      Start 474: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-kinetic
22/48 Test #474: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-kinetic .......   Passed    0.04 sec
      Start 475: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-totenergy
23/48 Test #475: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-totenergy .....   Passed    0.03 sec
      Start 476: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-eeenergy
24/48 Test #476: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-eeenergy ......   Passed    0.03 sec
      Start 477: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-samples
25/48 Test #477: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-samples .......   Passed    0.03 sec
      Start 478: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-potential
26/48 Test #478: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-potential .....   Passed    0.03 sec
      Start 479: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-ionion
27/48 Test #479: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-ionion ........   Passed    0.03 sec
      Start 480: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-localecp
28/48 Test #480: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-localecp ......   Passed    0.03 sec
      Start 481: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-nonlocalecp
29/48 Test #481: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-nonlocalecp ...   Passed    0.03 sec
      Start 482: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-flux
30/48 Test #482: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-flux ..........   Passed    0.03 sec
      Start 483: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1
31/48 Test #483: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1 ..............................   Passed    1.60 sec
      Start 484: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-kinetic
32/48 Test #484: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-kinetic ......................   Passed    0.03 sec
      Start 485: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-totenergy
33/48 Test #485: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-totenergy ....................   Passed    0.04 sec
      Start 486: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-eeenergy
34/48 Test #486: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-eeenergy .....................   Passed    0.03 sec
      Start 487: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-samples
35/48 Test #487: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-samples ......................   Passed    0.03 sec
      Start 488: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-potential
36/48 Test #488: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-potential ....................   Passed    0.03 sec
      Start 489: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-ionion
37/48 Test #489: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-ionion .......................   Passed    0.03 sec
      Start 490: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-localecp
38/48 Test #490: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-localecp .....................   Passed    0.03 sec
      Start 491: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-nonlocalecp
39/48 Test #491: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-nonlocalecp ..................   Passed    0.03 sec
      Start 492: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1
40/48 Test #492: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1 ......................   Passed    1.60 sec
      Start 493: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-kinetic
41/48 Test #493: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-kinetic ..............   Passed    0.04 sec
      Start 494: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-totenergy
42/48 Test #494: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-totenergy ............   Passed    0.03 sec
      Start 495: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-eeenergy
43/48 Test #495: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-eeenergy .............   Passed    0.03 sec
      Start 496: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-samples
44/48 Test #496: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-samples ..............   Passed    0.03 sec
      Start 497: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-potential
45/48 Test #497: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-potential ............   Passed    0.03 sec
      Start 498: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-ionion
46/48 Test #498: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-ionion ...............   Passed    0.03 sec
      Start 499: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-localecp
47/48 Test #499: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-localecp .............   Passed    0.03 sec
      Start 500: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-nonlocalecp
48/48 Test #500: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-nonlocalecp ..........   Passed    0.03 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
QMCPACK                     =   8.18 sec*proc (5 tests)
QMCPACK-checking-results    =   1.46 sec*proc (43 tests)
deterministic               =   9.65 sec*proc (48 tests)
quality_unknown             =   9.65 sec*proc (48 tests)

Total Test time (real) =   9.69 sec

ye-luo · 2021-02-08T20:00:19Z

The bora failure is sporadic. let us hope your fix helps it.

Fix race conditions

f74f6da

prckent marked this pull request as ready for review February 8, 2021 19:22

prckent added bug gpu labels Feb 8, 2021

ye-luo reviewed Feb 8, 2021

View reviewed changes

ye-luo approved these changes Feb 8, 2021

View reviewed changes

Merge branch 'develop' into fixcuda

c441d27

ye-luo merged commit a53fcc2 into QMCPACK:develop Feb 9, 2021

prckent deleted the fixcuda branch March 23, 2021 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions in CUDA kernels causing incorrect energies #2883

Fix race conditions in CUDA kernels causing incorrect energies #2883

prckent commented Feb 8, 2021 •

edited

Loading

ye-luo left a comment

ye-luo Feb 8, 2021

ye-luo Feb 8, 2021

ye-luo commented Feb 8, 2021 •

edited

Loading

prckent commented Feb 8, 2021

prckent commented Feb 8, 2021

ye-luo commented Feb 8, 2021

Fix race conditions in CUDA kernels causing incorrect energies #2883

Fix race conditions in CUDA kernels causing incorrect energies #2883

Conversation

prckent commented Feb 8, 2021 • edited Loading

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

ye-luo left a comment

Choose a reason for hiding this comment

ye-luo Feb 8, 2021

Choose a reason for hiding this comment

ye-luo Feb 8, 2021

Choose a reason for hiding this comment

ye-luo commented Feb 8, 2021 • edited Loading

prckent commented Feb 8, 2021

prckent commented Feb 8, 2021

ye-luo commented Feb 8, 2021

prckent commented Feb 8, 2021 •

edited

Loading

ye-luo commented Feb 8, 2021 •

edited

Loading