Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sacado: Build failures with cuda 11.4 #10342

Closed
fryeguy52 opened this issue Mar 17, 2022 · 10 comments
Closed

Sacado: Build failures with cuda 11.4 #10342

fryeguy52 opened this issue Mar 17, 2022 · 10 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

Bug Report

@trilinos/sacado
@jwillenbring
@e10harvey

Description

building the trilinos cuda PR configuration with cuda 11.4 has build failures in sacado

see this build on cdash

I am seeing he same failures when run on the sems gpu machine and on the weaver dev queue

Steps to Reproduce

sems-son-rhel7-gpu-01 or sems-srn-rhel7-gpu-01

Environment

export TRILINOS_DIR=~/Trilinos
source /projects/sems/modulefiles/utils/sems-modules-init.sh
module purge

module load sems-dev
module load sems-dev-gcc/10.1.0
module load sems-dev-openmpi/4.0.5-cuda-11.4.2
module load sems-dev-cuda/11.4.2

module load sems-dev-cgns/4.2.0
module load sems-dev-netcdf-c/4.8.1
module load sems-dev-netcdf-fortran/4.5.3
module load sems-dev-parmetis/4.0.3
module load sems-dev-hdf5/1.10.7
module load sems-dev-netcdf-cxx/4.2
module load sems-dev-parallel-netcdf/1.12.2
module load sems-dev-superlu-dist/7.1.1
module load sems-dev-boost/1.70.0
module load sems-dev-openblas/0.3.18
module load sems-dev-scotch/6.0.3
module load sems-dev-yaml-cpp/0.6.2
module load sems-dev-metis/5.1.0
module load sems-dev-zlib/1.2.11
module load sems-dev-superlu/5.3.0
module load sems-cmake/3.21.1
module load sems-ninja/1.10.1

################################                                                                                                                                                                                                               

export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
export BLAS_ROOT=$OPENBLAS_ROOT

configure

cmake \
  -C "${TRILINOS_DIR}/cmake/std/PullRequestLinuxCuda11.4.2uvmOffTestingSettings.cmake" \
  -DTrilinos_ENABLE_Sacado=ON \
  -D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \
  -DTrilinos_ENABLE_TESTS:BOOL=ON \
  ${TRILINOS_DIR}

Build log

[ 17%] Building CXX object packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o
cd /ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/sacado/test/UnitTests && "/net/watson.sandia.gov/storage/fast/projects/sems/install/rhel7-x86_64/sems/v2/utility/cmake/3.21.1/gcc/7.3.0/mxfpluq/bin/ctest" --launch --target-name Sacado_FadKokkosTests_Cuda_Hierarchical --build-dir /ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/sacado/test/UnitTests --output CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o --source /ascldap/users/jfrye/Trilinos/packages/sacado/test/UnitTests/Fad_KokkosTests_Cuda_Hierarchical.cpp --language CXX -- /ascldap/users/jfrye/trilinos-builds/sacado-cuda11/build_stat_cxx_wrapper.sh  -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11 -I/ascldap/users/jfrye/Trilinos/packages/sacado/test/utils -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/sacado/src -I/ascldap/users/jfrye/Trilinos/packages/sacado/src -I/ascldap/users/jfrye/Trilinos/packages/sacado/src/new_design -I/ascldap/users/jfrye/Trilinos/packages/sacado/src/template -I/ascldap/users/jfrye/Trilinos/packages/sacado/src/parameter -I/ascldap/users/jfrye/Trilinos/packages/sacado/src/mpl -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/teuchos/kokkoscomm/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/kokkoscomm/src -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/teuchos/kokkoscompat/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/kokkoscompat/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/parameterlist/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/parser/src -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/teuchos/core/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/core/src -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/kokkos/core/src -I/ascldap/users/jfrye/Trilinos/packages/kokkos/core/src -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/kokkos -I/projects/sems/install/rhel7-x86_64/sems-dev/tpl/cuda/11.4.2/gcc/10.1.0/base/5cgr5ga/include -I/projects/sems/install/rhel7-x86_64/sems-dev/tpl/boost/1.70.0/gcc/10.1.0/base/2vztcwf/include -I/ascldap/users/jfrye/Trilinos/packages/teuchos/comm/src -I/ascldap/users/jfrye/Trilinos/packages/teuchos/numerics/src -I/ascldap/users/jfrye/trilinos-builds/sacado-cuda11/packages/kokkos/containers/src -I/ascldap/users/jfrye/Trilinos/packages/kokkos/containers/src -pedantic -Wall -Wno-long-long -Wwrite-strings    -expt-extended-lambda -Wext-lambda-captures-this -lineinfo -arch=sm_70  -O3 -DNDEBUG -std=c++14 -MD -MT packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o -MF CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o.d -o CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o -c /ascldap/users/jfrye/Trilinos/packages/sacado/test/UnitTests/Fad_KokkosTests_Cuda_Hierarchical.cpp
/ascldap/users/jfrye/Trilinos/packages/sacado/src/Kokkos_DynRankView_Fad.hpp(408): error: identifier "Kokkos::Impl::ViewDimension1<(unsigned long)5ul, (unsigned int)1u> ::N1" is undefined in device code

/ascldap/users/jfrye/Trilinos/packages/sacado/src/Kokkos_DynRankView_Fad.hpp(431): error: identifier "Kokkos::Impl::ViewDimension2<(unsigned long)5ul, (unsigned int)2u> ::N2" is undefined in device code

/ascldap/users/jfrye/Trilinos/packages/sacado/src/Kokkos_DynRankView_Fad.hpp(385): error: identifier "Kokkos::Impl::ViewDimension0<(unsigned long)5ul, (unsigned int)0u> ::N0" is undefined in device code

3 errors detected in the compilation of "/ascldap/users/jfrye/Trilinos/packages/sacado/test/UnitTests/Fad_KokkosTests_Cuda_Hierarchical.cpp".
make[2]: *** [packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/Fad_KokkosTests_Cuda_Hierarchical.cpp.o] Error 1
make[2]: Leaving directory `/home/jfrye/trilinos-builds/sacado-cuda11'
make[1]: *** [packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_Cuda_Hierarchical.dir/all] Error 2
make[1]: Leaving directory `/home/jfrye/trilinos-builds/sacado-cuda11'
make: *** [all] Error 2

@fryeguy52 fryeguy52 added the type: bug The primary issue is a bug in Trilinos code or tests label Mar 17, 2022
@etphipp
Copy link
Contributor

etphipp commented Mar 23, 2022

With an updated Trilinos-dev, there is no file Trilinos/cmake/std/PullRequestLinuxCuda11.4.2uvmOffTestingSettings.cmake. Has it been committed to the repo?

@fryeguy52
Copy link
Contributor Author

fryeguy52 commented Mar 29, 2022

@etphipp Thanks for looking into this. That file is on a branch and not ready to be committed. I have done some more testing and am able to reproduce this on develop by setting my environment with:

export TRILINOS_DIR=~/Trilinos
source /projects/sems/modulefiles/utils/sems-modules-init.sh
module purge

module load sems-dev
module load sems-dev-gcc/10.1.0
module load sems-dev-openmpi/4.0.5-cuda-11.4.2
module load sems-dev-cuda/11.4.2

module load sems-dev-cgns/4.2.0
module load sems-dev-netcdf-c/4.8.1
module load sems-dev-netcdf-fortran/4.5.3
module load sems-dev-parmetis/4.0.3
module load sems-dev-hdf5/1.10.7
module load sems-dev-netcdf-cxx/4.2
module load sems-dev-parallel-netcdf/1.12.2
module load sems-dev-superlu-dist/7.1.1
module load sems-dev-boost/1.70.0
module load sems-dev-openblas/0.3.18
module load sems-dev-scotch/6.0.3
module load sems-dev-yaml-cpp/0.6.2
module load sems-dev-metis/5.1.0
module load sems-dev-zlib/1.2.11
module load sems-dev-superlu/5.3.0
module load sems-cmake/3.21.1
module load sems-ninja/1.10.1

################################
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper

export BLAS_ROOT=$OPENBLAS_ROOT

export SEMS_BOOST_INCLUDE_PATH=BOOST_INC
export SEMS_BOOST_LIBRARY_PATH=BOOST_LIB

export SEMS_BOOST_INCLUDE_PATH=BOOST_INC
export SEMS_BOOST_LIBRARY_PATH=BOOST_LIB

export SEMS_PARMETIS_INCLUDE_PATH=PARMETIS_INC
export SEMS_PARMETIS_LIBRARY_PATH=PARMETIS_LIB

export SEMS_ZLIB_INCLUDE_PATH=ZLIB_INC
export SEMS_ZLIB_LIBRARY_PATH=ZLIB_LIB

export SEMS_HDF5_INCLUDE_PATH=HDF5_INC
export SEMS_HDF5_LIBRARY_PATH=HDF5_LIB

export SEMS_NETCDF_INCLUDE_PATH=NETCDF_C_INC
export SEMS_NETCDF_LIBRARY_PATH=NETCDF_C_LIB

export SEMS_SUPERLU_INCLUDE_PATH=SUPERLU_INC
export SEMS_SUPERLU_LIBRARY_PATH=SUPERLU_LIB

export SEMS_SCOTCH_INCLUDE_PATH=SCOTCH_INC
export SEMS_SCOTCH_LIBRARY_PATH=SCOTCH_LIB

and configuring with:

cmake \
  -C "${TRILINOS_DIR}cmake/std/PullRequestLinuxCommonTestingSettings.cmake" \
  -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \
  -DTrilinos_ENABLE_Sacado=ON \
  -DTPL_BLAS_LIBRARIES="-L$ENV{BLAS_ROOT}/lib;-lopenblas;-lgfortran;-lgomp;-lm" \
  -DTPL_LAPACK_LIBRARIES="-L$ENV{LAPACK_ROOT}/lib;-lopenblas;-lgfortran;-lgomp" \
  -DKokkos_ENABLE_CUDA=ON \
  -DKokkos_ARCH_VOLTA70=ON \
  -DKokkos_ENABLE_CUDA_LAMBDA=ON \
  -DTrilinos_ENABLE_TESTS:BOOL=ON \
  ${TRILINOS_DIR}

@fryeguy52
Copy link
Contributor Author

@etphipp do the instructions above help to reproduce?

@ccober6
Copy link
Contributor

ccober6 commented Apr 27, 2022

Hi @etphipp, do you have a status on this? This issue seems to be the next hurdle in getting cuda build going. Thanks!

@etphipp
Copy link
Contributor

etphipp commented Apr 27, 2022

Yes I was finally able to reproduce it. I'm at a loss as to the fix though, or why this hasn't come up in other Cuda builds. @ndellingwood, do you have any thoughts?

@etphipp
Copy link
Contributor

etphipp commented Apr 27, 2022

After some googling, I appear to have come across the solution. According to this: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#const-variables, we cannot be taking the reference of static constexpr variables in device code, which we are doing here:

template <unsigned> struct AssignDim7 {
  template <typename Dst>
  KOKKOS_INLINE_FUNCTION
  static void eval(Dst& dst, const size_t& src_dim) {}
};
template <> struct AssignDim7<0u> {
  template <typename Dst>
  KOKKOS_INLINE_FUNCTION
  static void eval(Dst& dst, const size_t& src_dim) {
    dst.N7 = src_dim;
  }

Passing src_dim by value instead appears to work.

@ndellingwood
Copy link
Contributor

I'm at a loss as to the fix though, or why this hasn't come up in other Cuda builds. @ndellingwood, do you have any thoughts?

@etphipp I'm not sure though this showed up somewhere else (can't recall at the moment though I'll add a link once I track it). I'm guessing that even though it is not supported it just happened to work until the more recent releases of Cuda

@ndellingwood
Copy link
Contributor

ndellingwood commented Apr 27, 2022

This might be related (as in underlying Cuda issues with taking reference of variable in device code showing with cuda/11), though a runtime issue rather than compilation issue kokkos/kokkos-kernels#1373 (comment)?

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Apr 29, 2023
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants