-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN check failure with clang in offload build in multi slater determinant test #4767
Comments
Addresses QMCPACK#4767 by assigning each component of the TinyVector individually. Also adds ratioGradRef_list_ptr to the mapping list.
This was the cause of #3922 |
Could you check what happens with the complex build? IIRC, there are more failure related to missing the second component assignment in the same area. |
Scatch that: build configuration had a typo, didn't test the complex build. |
Complex offload build deterministic test failures:
|
On my machine, with cuda 12.2
If I commented that out, I got OK gradients. |
For the complex build, if I fix a number of reductions in a similar manner (effectively copy real and imaginary components separately), those gradient tests pass. |
The type of the index variable in the outer loop (over nw) seems to matter. Changing from |
I wrote up an analysis of the correct-looking PTX and apparently incorrect SASS here: llvm/llvm-project#54633 (comment) |
Loop index over nw must be 32 bits in size. Bug affects offload with NVidia. See QMCPACK#4767
I still saw failure in the mixed precision builds (tested CUDA 11.8 and 12.2). Full precision builds are fine. |
The test in mixed precision fails on the same assignment after a reduction. This time the x and y components copy okay (64 bits worth of data), but the z component is zero. Similar workarounds are effective (copy each element individually) or, with astounding regularity to the bug pattern, reduce the outer loop variable over the number of walkers from 32 to 16 bits (uint16_t). |
Closed by #4797 . Tests are passing now. |
Built using Clang for offload (
-D ENABLE_OFFLOAD=1
, the setting ofENABLE_CUDA
does not matter)Tested LLVM versions are 16, 17 and 18.
Reproduces in release and debug builds.
The unit test test_wavefunction_determinant fails:
It may depend on the CUDA version. With CUDA 11.2.2 the test fails, but in a different way (values don't match). CUDA 11.4, 11.6 and 12.2 show the NaN failure.
The problem seems to be an assignment of a TinyVector after a reduction, here:
qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp
Line 795 in bb8c887
Adding the print statements
Produces the output
The assignment fails for the second component of the vector.
(Side note on cuda version: the build with cuda 11.2.2 shows the correct assignment behavior here.)
A workaround is to assign the components individually.
The text was updated successfully, but these errors were encountered: