-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions in CUDA kernels causing incorrect energies #2883
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make deterministic tests passing more robustly?
https://cdash.qmcpack.org/CDash/viewTest.php?buildid=183717&onlydelta
@@ -312,6 +312,7 @@ __global__ void find_core_electrons_PBC_kernel(T** R, | |||
disp[tid][1] = r[tid][1] - i[ion][1]; | |||
disp[tid][2] = r[tid][2] - i[ion][2]; | |||
dist[tid] = min_dist<T>(disp[tid][0], disp[tid][1], disp[tid][2], L, Linv); | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubt this really matters.
@@ -1412,6 +1412,7 @@ __global__ void calc_ratio_grad_lapl(T** Ainv_list, | |||
// ratio to make it w.r.t. new position | |||
if (tid < 4) | |||
ratio_prod[(tid + 1) * BS1] /= ratio_prod[0]; | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubt this really matter.
I think the bug is not releated to Volta independent thread scheduling. Deterministic tests are failing on my Pascal as well. I think more efficient warp scheduling tricks the bug. The diamond test just has too low electron counts. |
As you mention, some of the deterministic tests are failing. I don't think any of these changes will help with them. An isolated water molecule in a periodic supercell tripped the problem. This has the same electron count as our diamond tests. However the wavefunction is large. Possibly we could make a workflow test out of it. There are some excess syncthreads calls to try to quiet racecheck. Since they didn't slow the code noticeably and they make the code a bit easier to "parse" mentally for threading issues, I prefer to keep them. |
Actually the diamondC_2x1x1 are working locally, see below. Hopefully the runs on bora will also improve:
|
The bora failure is sporadic. let us hope your fix helps it. |
Proposed changes
Fix reported bugs by Andrea Zen / Dario Alfe where the legacy CUDA code will give clearly wrong energies for certain calculations. The symptom is a total energy as well as several subcomponents of the total energy that vary significantly with walker count. This problem was initially discovered via non-sensical DMC results, but the problems also occur in VMC.
Bugs are traced to handful of potential race conditions in the kernels associated with Coulomb energy evaluation. These bugs are present in Volta generation and more recent GPUs due to their independent thread scheduling. Our regular testing for e.g. carbon diamond did not identify them.
A side effect of these fixes is that the legacy CUDA code now appears to be completely deterministic.
Despite the extra syncthreads calls, running the performance tests showed no change in runtime within measurement error.
I do not have a small reproducer for the problems.
Thanks to @jefflarkin for reminders of CUDA tools (cuda-memcheck) that helped track these problems down efficiently.
The racecheck tool still issues a warning in find_core_electrons_PBC_kernel in NLPP.cu . It looks to be a false warning but needs more study. Otherwise the code is clean of races in VMC for the standard spline+j1+j2 inputs that I tried.
The initcheck tool flags a couple of uninitialized memory references associated with determinant setup and update.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
nitrogen: AMD Rome + NVIDIA V100 32GB. LLVM dev + CUDA 11.2.0
Checklist
( Created the PR from within vscode, hence initial draft. )