Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stokhos: Build errors with new CUDA 11 build #11297

Closed
srbdev opened this issue Nov 21, 2022 · 8 comments
Closed

Stokhos: Build errors with new CUDA 11 build #11297

srbdev opened this issue Nov 21, 2022 · 8 comments
Labels
pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@srbdev
Copy link
Contributor

srbdev commented Nov 21, 2022

Bug Report

@trilinos/stokhos

Description

There are build errors for (and related to) the Stokhos package for the rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_all build. It is currently ran on the following fork https://github.com/e10harvey/Trilinos and branch sprint17_cpp17.

See the CDash link for additional details.

@ndellingwood
Copy link
Contributor

Based on the compilation error messages in the cdash link this is likely following merge of #11263 , @vqd8a can you help take a look?
@srbdev is Stokhos disabled in the cuda/11 PR testing? If so, that is likely how this slipped through

@srbdev
Copy link
Contributor Author

srbdev commented Nov 21, 2022

It used to be but not anymore. I've deleted all the explicitly removed packages from the build in order to catch as many of these errors as possible. But yes, that's the reason it slipped through in the past.

@ndellingwood
Copy link
Contributor

@vqd8a here is a compilation error snip from the first cuda/11.4.2 error showing in stokhos:

Error while building C++ object file " packages/stokhos/src/CMakeFiles/stokhos_ifpack2_mp_16_serial.dir/Ifpack2_LocalSparseTriangularSolver_MP_Vector_16_Serial.cpp.o" in target stokhos_ifpack2_mp_16_serial

trilinos/workspace/Trilinos_nightly_pipeline/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_cusparse.hpp(120): error: static assertion failed with "cuSparse TPL does not support scalar type"
          detected during:
            instantiation of "cudaDataType KokkosSparse::Impl::cuda_data_type_from<T>() [with T=Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode>]" 
trilinos/workspace/Trilinos_nightly_pipeline/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_sptrsv_cuSPARSE_impl.hpp(119): here
            instantiation of "void KokkosSparse::Impl::sptrsvcuSPARSE_symbolic(KernelHandle *, KernelHandle::nnz_lno_t, ain_row_index_view_type, ain_nonzero_index_view_type, ain_values_scalar_view_type, __nv_bool) [with KernelHandle=KokkosSparse::Experimental::SPTRSVHandle<const size_t, const int, const Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode>, Kokkos::Serial::execution_space, Kokkos::HostSpace::memory_space, Kokkos::HostSpace::memory_space>, ain_row_index_view_type=Kokkos::View<const size_t *, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<3U>>, ain_nonzero_index_view_type=Kokkos::View<const int *, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<3U>>, ain_values_scalar_view_type=Kokkos::View<std::add_const_t<Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode>> *, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<3U>>]" 
trilinos/workspace/Trilinos_nightly_pipeline/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_sptrsv.hpp(211): here
            instantiation of "void KokkosSparse::Experimental::sptrsv_symbolic(KernelHandle *, lno_row_view_t_, lno_nnz_view_t_, scalar_nnz_view_t_) [with KernelHandle=KokkosKernels::Experimental::KokkosKernelsHandle<std::add_const_t<const size_t>, std::add_const_t<int>, Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode>, Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space, Kokkos::HostSpace::memory_space>, lno_row_view_t_=Kokkos::View<const size_t *, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<0U>>, lno_nnz_view_t_=Kokkos::View<std::remove_const_t<const int> *, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<0U>>, scalar_nnz_view_t_=Kokkos::View<Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode> *, Kokkos::LayoutRight, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, void>]" 
trilinos/workspace/Trilinos_nightly_pipeline/Trilinos/packages/ifpack2/src/Ifpack2_LocalSparseTriangularSolver_def.hpp(589): here
            instantiation of "void Ifpack2::LocalSparseTriangularSolver<MatrixType>::compute() [with MatrixType=Tpetra::RowMatrix<Sacado::MP::Vector<SFS_int_double_16_DFN_CPU_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode>, default_local_ordinal_type, default_global_ordinal_type, Kokkos_Compat_KokkosSerialWrapperNode>]" 
trilinos/workspace/Trilinos_nightly_pipeline/pull_request_test/packages/stokhos/src/Ifpack2_LocalSparseTriangularSolver_MP_Vector_16_Serial.cpp(55): here

1 error detected in the compilation of "trilinos/workspace/Trilinos_nightly_pipeline/pull_request_test/packages/stokhos/src/Ifpack2_LocalSparseTriangularSolver_MP_Vector_16_Serial.cpp".

@vqd8a
Copy link
Contributor

vqd8a commented Nov 22, 2022

@ndellingwood I am not sure why the execution space is Serial and memory space is HostSpace and it picks the SPTRSV_CUSPARSE for the KK sptrsv. Do you have any idea?

@etphipp
Copy link
Contributor

etphipp commented Nov 23, 2022

I am guessing the issue is Stokhos is ultimately calling cusparse by way of Ipack2 with its special scalar types which cusparse doesn't support. I am guessing previously there was a native Ifpack2 implementation that did work with Stokhos scalar types, but it now calls cusparse instead. So maybe Ifpack2 should only call cusparse if it is both enabled and it is one of the scalar types supported by cusparse.

@etphipp
Copy link
Contributor

etphipp commented Nov 25, 2022

After further inspection, I don't believe these errors are due to the aforementioned PR. They have been lurking for a while now in Kokkos Kernels and Ifpack2, and it is just that Trilinos testing is using a recent enough CUDA toolkit to turn on the "new" cuSPARSE interface in Kokkos Kernels. I have submitted a PR #11308 to Kokkos-Kernels to address this.

@vqd8a
Copy link
Contributor

vqd8a commented Nov 27, 2022

@etphipp Thanks for the PR to fix the compile errors.
I have had a PR #11311 to only allow Ifpack2 to call cuSPARSE SPTRSV if cuSPARSE is enabled and if the scalar type is supported by cuSPARSE.

@etphipp
Copy link
Contributor

etphipp commented Nov 28, 2022

I believe this is resolved now so closing. If you still run into trouble, please re-open.

@etphipp etphipp closed this as completed Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants