Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN check failure with clang in offload build in multi slater determinant test #4767

Closed
markdewing opened this issue Oct 11, 2023 · 11 comments
Closed

Comments

@markdewing
Copy link
Contributor

Built using Clang for offload (-D ENABLE_OFFLOAD=1, the setting of ENABLE_CUDA does not matter)
Tested LLVM versions are 16, 17 and 18.
Reproduces in release and debug builds.

The unit test test_wavefunction_determinant fails:

-------------------------------------------------------------------------------
LiH multi Slater dets precomputed_table_method
-------------------------------------------------------------------------------
/home/mdewing/nvme/physics/qmcpack/main/qmcpack/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp:360
...............................................................................

/home/mdewing/nvme/physics/qmcpack/main/qmcpack/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp:360: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  NaN check in TWF::mw_calcRatioGrad found
    particle 1 grads[1] is NaN.

It may depend on the CUDA version. With CUDA 11.2.2 the test fails, but in a different way (values don't match). CUDA 11.4, 11.6 and 12.2 show the NaN failure.

The problem seems to be an assignment of a TinyVector after a reduction, here:

ratioGradRef_list_ptr[iw] = ratioGradRef_local;

Adding the print statements

      ratioGradRef_list_ptr[iw] = ratioGradRef_local;
      printf("ratiograd1 local  %lu %g %g %g \n",iw,ratioGradRef_local[0], ratioGradRef_local[1], ratioGradRef_local[2]);
      printf("ratiograd1 device %lu %g %g %g \n",iw,ratioGradRef_list_ptr[iw][0], ratioGradRef_list_ptr[iw][1], ratioGradRef_list_ptr[iw][2]);

Produces the output

ratiograd1 local  1 -0.785575 0.785575 -3.95485 
ratiograd1 local  0 -0.477015 -0.477015 -0.965852 
ratiograd1 device 1 -0.785575 0 -3.95485 
ratiograd1 device 0 -0.477015 0 -0.965852 

The assignment fails for the second component of the vector.
(Side note on cuda version: the build with cuda 11.2.2 shows the correct assignment behavior here.)

A workaround is to assign the components individually.

markdewing added a commit to markdewing/qmcpack that referenced this issue Oct 11, 2023
Addresses QMCPACK#4767 by assigning
each component of the TinyVector individually.

Also adds ratioGradRef_list_ptr to the mapping list.
@ye-luo
Copy link
Contributor

ye-luo commented Oct 11, 2023

This was the cause of #3922

@ye-luo
Copy link
Contributor

ye-luo commented Oct 11, 2023

Could you check what happens with the complex build? IIRC, there are more failure related to missing the second component assignment in the same area.

@markdewing
Copy link
Contributor Author

markdewing commented Oct 11, 2023

Scatch that: build configuration had a typo, didn't test the complex build.

@markdewing
Copy link
Contributor Author

Complex offload build deterministic test failures:

  • test_multi_slater_determinant, Bi-spinor multi Slater dets - values of ratios don't match
  • deterministic-diamondC_2x1x1-gaussian_pp_MSD-vmcbatch-dmcbatch-mwalkers_sdj

@ye-luo
Copy link
Contributor

ye-luo commented Oct 11, 2023

On my machine, with cuda 12.2
Ratio tests failed.

diff --git a/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp b/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
index a52932ebf..7ebd19f28 100644
--- a/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
+++ b/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
@@ -530,8 +530,8 @@ void test_Bi_msd(const std::string& spo_xml_string,
 
   std::vector<PsiValueType> ratios(num_walkers);
   TrialWaveFunction::mw_calcRatio(twf_list, p_list, moved_elec_id, ratios);
-  for (int iw = 0; iw < num_walkers; iw++)
-    CHECK(ValueType(std::abs(ratios[iw])) == ValueApprox(0.991503).epsilon(1e-4));
+  //for (int iw = 0; iw < num_walkers; iw++)
+  //  CHECK(ValueType(std::abs(ratios[iw])) == ValueApprox(0.991503).epsilon(1e-4));
   std::fill(ratios.begin(), ratios.end(), ValueType(0));
 
   TWFGrads<CoordsType::POS_SPIN> grads_new(num_walkers);

If I commented that out, I got OK gradients.

@markdewing
Copy link
Contributor Author

For the complex build, if I fix a number of reductions in a similar manner (effectively copy real and imaginary components separately), those gradient tests pass.

@markdewing
Copy link
Contributor Author

The type of the index variable in the outer loop (over nw) seems to matter. Changing from size_t to uint32_t seems to make the issue go away.

@markdewing
Copy link
Contributor Author

I wrote up an analysis of the correct-looking PTX and apparently incorrect SASS here: llvm/llvm-project#54633 (comment)

markdewing added a commit to markdewing/qmcpack that referenced this issue Oct 13, 2023
Loop index over nw must be 32 bits in size.
Bug affects offload with NVidia.
See QMCPACK#4767
@ye-luo ye-luo closed this as completed Oct 14, 2023
@ye-luo
Copy link
Contributor

ye-luo commented Oct 16, 2023

I still saw failure in the mixed precision builds (tested CUDA 11.8 and 12.2). Full precision builds are fine.
https://cdash.qmcpack.org/CDash/testSummary.php?project=1&name=deterministic-unit_test_wavefunction_determinant&date=2023-10-16

@ye-luo ye-luo reopened this Oct 16, 2023
@markdewing
Copy link
Contributor Author

The test in mixed precision fails on the same assignment after a reduction. This time the x and y components copy okay (64 bits worth of data), but the z component is zero. Similar workarounds are effective (copy each element individually) or, with astounding regularity to the bug pattern, reduce the outer loop variable over the number of walkers from 32 to 16 bits (uint16_t).
Reducing the loop variable from 64 bits to 32 bits was a reasonable change. I do not think reducing it further to 16 bits is a reasonable change for the workaround.

@prckent
Copy link
Contributor

prckent commented Oct 30, 2023

Closed by #4797 . Tests are passing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants