NaN check failure with clang in offload build in multi slater determinant test #4767

markdewing · 2023-10-11T15:31:38Z

Built using Clang for offload (-D ENABLE_OFFLOAD=1, the setting of ENABLE_CUDA does not matter)
Tested LLVM versions are 16, 17 and 18.
Reproduces in release and debug builds.

The unit test test_wavefunction_determinant fails:

-------------------------------------------------------------------------------
LiH multi Slater dets precomputed_table_method
-------------------------------------------------------------------------------
/home/mdewing/nvme/physics/qmcpack/main/qmcpack/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp:360
...............................................................................

/home/mdewing/nvme/physics/qmcpack/main/qmcpack/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp:360: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  NaN check in TWF::mw_calcRatioGrad found
    particle 1 grads[1] is NaN.

It may depend on the CUDA version. With CUDA 11.2.2 the test fails, but in a different way (values don't match). CUDA 11.4, 11.6 and 12.2 show the NaN failure.

The problem seems to be an assignment of a TinyVector after a reduction, here:

qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp

Line 795 in bb8c887

ratioGradRef_list_ptr[iw] = ratioGradRef_local;

Adding the print statements

      ratioGradRef_list_ptr[iw] = ratioGradRef_local;
      printf("ratiograd1 local  %lu %g %g %g \n",iw,ratioGradRef_local[0], ratioGradRef_local[1], ratioGradRef_local[2]);
      printf("ratiograd1 device %lu %g %g %g \n",iw,ratioGradRef_list_ptr[iw][0], ratioGradRef_list_ptr[iw][1], ratioGradRef_list_ptr[iw][2]);

Produces the output

ratiograd1 local  1 -0.785575 0.785575 -3.95485 
ratiograd1 local  0 -0.477015 -0.477015 -0.965852 
ratiograd1 device 1 -0.785575 0 -3.95485 
ratiograd1 device 0 -0.477015 0 -0.965852

The assignment fails for the second component of the vector.
(Side note on cuda version: the build with cuda 11.2.2 shows the correct assignment behavior here.)

A workaround is to assign the components individually.

The text was updated successfully, but these errors were encountered:

Addresses QMCPACK#4767 by assigning each component of the TinyVector individually. Also adds ratioGradRef_list_ptr to the mapping list.

ye-luo · 2023-10-11T15:48:05Z

This was the cause of #3922

ye-luo · 2023-10-11T16:04:46Z

Could you check what happens with the complex build? IIRC, there are more failure related to missing the second component assignment in the same area.

markdewing · 2023-10-11T16:35:26Z

Scatch that: build configuration had a typo, didn't test the complex build.

markdewing · 2023-10-11T17:27:26Z

Complex offload build deterministic test failures:

test_multi_slater_determinant, Bi-spinor multi Slater dets - values of ratios don't match
deterministic-diamondC_2x1x1-gaussian_pp_MSD-vmcbatch-dmcbatch-mwalkers_sdj

ye-luo · 2023-10-11T19:41:41Z

On my machine, with cuda 12.2
Ratio tests failed.

diff --git a/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp b/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
index a52932ebf..7ebd19f28 100644
--- a/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
+++ b/src/QMCWaveFunctions/tests/test_multi_slater_determinant.cpp
@@ -530,8 +530,8 @@ void test_Bi_msd(const std::string& spo_xml_string,
 
   std::vector<PsiValueType> ratios(num_walkers);
   TrialWaveFunction::mw_calcRatio(twf_list, p_list, moved_elec_id, ratios);
-  for (int iw = 0; iw < num_walkers; iw++)
-    CHECK(ValueType(std::abs(ratios[iw])) == ValueApprox(0.991503).epsilon(1e-4));
+  //for (int iw = 0; iw < num_walkers; iw++)
+  //  CHECK(ValueType(std::abs(ratios[iw])) == ValueApprox(0.991503).epsilon(1e-4));
   std::fill(ratios.begin(), ratios.end(), ValueType(0));
 
   TWFGrads<CoordsType::POS_SPIN> grads_new(num_walkers);

If I commented that out, I got OK gradients.

markdewing · 2023-10-11T19:47:31Z

For the complex build, if I fix a number of reductions in a similar manner (effectively copy real and imaginary components separately), those gradient tests pass.

markdewing · 2023-10-13T04:51:46Z

The type of the index variable in the outer loop (over nw) seems to matter. Changing from size_t to uint32_t seems to make the issue go away.

markdewing · 2023-10-13T15:36:03Z

I wrote up an analysis of the correct-looking PTX and apparently incorrect SASS here: llvm/llvm-project#54633 (comment)

Loop index over nw must be 32 bits in size. Bug affects offload with NVidia. See QMCPACK#4767

ye-luo · 2023-10-16T19:10:28Z

I still saw failure in the mixed precision builds (tested CUDA 11.8 and 12.2). Full precision builds are fine.
https://cdash.qmcpack.org/CDash/testSummary.php?project=1&name=deterministic-unit_test_wavefunction_determinant&date=2023-10-16

markdewing · 2023-10-23T19:11:10Z

The test in mixed precision fails on the same assignment after a reduction. This time the x and y components copy okay (64 bits worth of data), but the z component is zero. Similar workarounds are effective (copy each element individually) or, with astounding regularity to the bug pattern, reduce the outer loop variable over the number of walkers from 32 to 16 bits (uint16_t).
Reducing the loop variable from 64 bits to 32 bits was a reasonable change. I do not think reducing it further to 16 bits is a reasonable change for the workaround.

prckent · 2023-10-30T19:37:50Z

Closed by #4797 . Tests are passing now.

markdewing added a commit to markdewing/qmcpack that referenced this issue Oct 11, 2023

Fix NaN check failure in offload build with clang

52e5fb6

Addresses QMCPACK#4767 by assigning each component of the TinyVector individually. Also adds ratioGradRef_list_ptr to the mapping list.

markdewing mentioned this issue Oct 11, 2023

Fix NaN check failure in offload build with clang #4768

Closed

markdewing added a commit to markdewing/qmcpack that referenced this issue Oct 13, 2023

Workaround for assignment-after-reduction bug

dccc148

Loop index over nw must be 32 bits in size. Bug affects offload with NVidia. See QMCPACK#4767

markdewing mentioned this issue Oct 13, 2023

Workaround for assignment-after-reduction offload bug #4773

Merged

ye-luo closed this as completed Oct 14, 2023

ye-luo reopened this Oct 16, 2023

markdewing mentioned this issue Oct 23, 2023

Workaround for assignment-after-reduction bug in mixed precision #4797

Merged

prckent closed this as completed Oct 30, 2023

ye-luo mentioned this issue Nov 17, 2023

Check compilers before handling MPI in CMake #4834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN check failure with clang in offload build in multi slater determinant test #4767

NaN check failure with clang in offload build in multi slater determinant test #4767

markdewing commented Oct 11, 2023

ye-luo commented Oct 11, 2023

ye-luo commented Oct 11, 2023

markdewing commented Oct 11, 2023 •

edited

Loading

markdewing commented Oct 11, 2023

ye-luo commented Oct 11, 2023

markdewing commented Oct 11, 2023

markdewing commented Oct 13, 2023

markdewing commented Oct 13, 2023

ye-luo commented Oct 16, 2023

markdewing commented Oct 23, 2023

prckent commented Oct 30, 2023

NaN check failure with clang in offload build in multi slater determinant test #4767

NaN check failure with clang in offload build in multi slater determinant test #4767

Comments

markdewing commented Oct 11, 2023

ye-luo commented Oct 11, 2023

ye-luo commented Oct 11, 2023

markdewing commented Oct 11, 2023 • edited Loading

markdewing commented Oct 11, 2023

ye-luo commented Oct 11, 2023

markdewing commented Oct 11, 2023

markdewing commented Oct 13, 2023

markdewing commented Oct 13, 2023

ye-luo commented Oct 16, 2023

markdewing commented Oct 23, 2023

prckent commented Oct 30, 2023

markdewing commented Oct 11, 2023 •

edited

Loading