Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS and cuBLAS TPL interface for TRSM #532

Merged
merged 13 commits into from
Dec 20, 2019
Merged

BLAS and cuBLAS TPL interface for TRSM #532

merged 13 commits into from
Dec 20, 2019

Conversation

vqd8a
Copy link
Contributor

@vqd8a vqd8a commented Dec 13, 2019

Address the user request in #513 . Only call cuBLAS for CUDA and BLAS for non-CUDA. The KK implementation might be added in another future PR.

@vqd8a vqd8a changed the base branch from master to develop December 13, 2019 23:41
Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to merge this if the fall-back case is not implemented? I think it's OK to have a sequential fall-back, even if that means copying to host and running sequentially (e.g., for CudaSpace Views of element types that cuBLAS does not support).

src/blas/KokkosBlas3_trsm.hpp Outdated Show resolved Hide resolved
src/blas/KokkosBlas3_trsm.hpp Outdated Show resolved Hide resolved
src/blas/impl/KokkosBlas3_trsm_impl.hpp Outdated Show resolved Hide resolved
src/blas/impl/KokkosBlas3_trsm_spec.hpp Outdated Show resolved Hide resolved
@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 16, 2019

@mhoemmen Thanks for your comments. I think the user is OK to have only TPL interfaces in the 3.0 release (#513). Add @iyamazaki for this. I can cover the sequential fall-back in another PR.

@vqd8a vqd8a requested a review from srajama1 December 16, 2019 20:03
@iyamazaki
Copy link
Contributor

Thank you @vqd8a, and @mhoemmen, I want to use this for CholQR in our native Tpetra solvers (which currently copies the basis vectors to the host, and calls BLAS::TRSM).

@mhoemmen
Copy link
Contributor

@iyamazaki wrote:

I want to use this for CholQR in our native Tpetra solvers (which currently copies the basis vectors to the host, and calls BLAS::TRSM).

Sure, that's fine. I already have cuSOLVER wrapped for GEQRF in my branch; it's almost ready. I actually have a CholQR implementation in there. I haven't tried it in a while, but it should be straightforward to revive.

@srajama1
Copy link
Contributor

It would be great to have have GEQRF interface in Kokkos Kernels as well. Looking forward to it !

@mhoemmen
Copy link
Contributor

I would be happy if kokkos-kernels imitated the design. In particular, I was careful not to expose any TPL headers or types in TSQR header files. That's important especially because of the cuBLAS v1 vs. v2 header compatibility problem that we've seen before. It's also important not to put extern "C" declarations for BLAS functions in header files, since they might conflict with applications' own extern "C" declarations.

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 17, 2019

@mhoemmen I am working on adding the sequential fall-back implementation now. Almost done. Needs to change unit tests though since I used BLAS TRMM to generate data.

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 17, 2019

It would be great to have have GEQRF interface in Kokkos Kernels as well. Looking forward to it !

@srajama1 I can work on this GEQRF interface after this PR.

KOKKOSBLAS3_TRSM_TPL_SPEC_AVAIL_BLAS( Kokkos::complex<double>, Kokkos::LayoutLeft, Kokkos::LayoutLeft, Kokkos::HostSpace)
#endif
#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_FLOAT_) \
&& defined (KOKKOSKERNELS_INST_LAYOUTLEFT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out which order this PR and PR #526 should go in. This would affect how this file gets implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjwilke This file is only for TPL. I do not think it will be affected by the new ETI implementation.

unit_test/Makefile Show resolved Hide resolved
@srajama1
Copy link
Contributor

@vqd8a Please talk to @mhoemmen to address the requirements mentioned above when you start on GEQRF.

@ndellingwood
Copy link
Contributor

@vqd8a is this PR ready for merge?

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 19, 2019

@ndellingwood I am still checking on the implementation of conjugate transpose.

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 19, 2019

@mhoemmen @ndellingwood I added the sequential fall-back implementation. If it is okay, let me know and I can run spotcheck.

@ndellingwood
Copy link
Contributor

@vqd8a doesn't hurt to start a spot-check, please also run tests with tpls enabled and disabled to confirm behavior is as expected. Thanks!

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 19, 2019

@ndellingwood one question: this PR now have fall-back implementation, should I need to generate ETI files and add them too or should I wait for @jjwilke new ETI implementation?

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 19, 2019

@ndellingwood If I run tests, I have to generate these ETI files using the old script. and do I need to push those files to this PR?

src/blas/impl/KokkosBlas3_trsm_impl.hpp Outdated Show resolved Hide resolved
@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

Running spotchecks now.

@mhoemmen mhoemmen dismissed their stale review December 20, 2019 18:56

Author addressed my concern about __restrict__.

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

One spotcheck (with BLAS TPL) passed. Two more are running

../kokkos-kernels/scripts/test_all_sandia gcc --spot-check --with-tpls=blas

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 gcc/7.4.0
Testing compiler gcc/6.4.0
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.4.0
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.4.0-OpenMP-release
  PASSED gcc-7.4.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
gcc-6.4.0-OpenMP_Serial-release build_time=580 run_time=829
gcc-7.2.0-OpenMP-release build_time=414 run_time=317
gcc-7.2.0-OpenMP_Serial-release build_time=591 run_time=723
gcc-7.2.0-Serial-release build_time=241 run_time=449
gcc-7.4.0-OpenMP-release build_time=417 run_time=307
#######################################################
FAILED TESTS
#######################################################

@ndellingwood
Copy link
Contributor

@vqd8a just to check, does the spot-check include the fallback or just the tpls, and if the fallback is tested does this work with/without ETI files?

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

@ndellingwood This one was just with BLAS TPL. I have two others running (one with CUBLAS and one with the fallback). For the fallback, I have to generate the ETI files. So, for merging, should I push the ETI files here too?

@ndellingwood
Copy link
Contributor

@vqd8a I just checked with @jjwilke, go ahead and merge in the ETI files here.

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

Thanks @ndellingwood @jjwilke

Spotcheck with cuBLAS passed:

../kokkos-kernels/scripts/test_all_sandia cuda --spot-check --arch=Power8,Pascal60 --with-cuda-options=enable_lamda --with-tpls=cublas

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
Testing compiler cuda/10.0.130
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1015 run_time=768
cuda-9.2.88-Cuda_OpenMP-release build_time=1026 run_time=584
#######################################################
FAILED TESTS
#######################################################

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

Spotcheck with fall-back implementation passed. ETI files were pushed in.

../kokkos-kernels/scripts/test_all_sandia --spot-check --arch=Power8,Pascal60 --with-cuda-options=enable_lamda

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 gcc/7.4.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
Testing compiler gcc/7.4.0
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job gcc-7.4.0-OpenMP-release
  PASSED gcc-7.4.0-OpenMP-release
Testing compiler cuda/9.2.88
  Starting job ibm-16.1.0-Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/10.0.130
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1052 run_time=846
cuda-9.2.88-Cuda_OpenMP-release build_time=1009 run_time=655
gcc-6.4.0-OpenMP_Serial-release build_time=556 run_time=873
gcc-7.2.0-OpenMP-release build_time=398 run_time=317
gcc-7.2.0-OpenMP_Serial-release build_time=563 run_time=1167
gcc-7.2.0-Serial-release build_time=230 run_time=461
gcc-7.4.0-OpenMP-release build_time=398 run_time=311
ibm-16.1.0-Serial-release build_time=1420 run_time=784
#######################################################
FAILED TESTS
#######################################################

@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

@ndellingwood Spotchecks passed. If possible, can you merge it?

Copy link
Contributor

@ndellingwood ndellingwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot-checks have passed, I think @mhoemmen 's comments have been addressed.

@ndellingwood ndellingwood merged commit d29beba into develop Dec 20, 2019
@vqd8a
Copy link
Contributor Author

vqd8a commented Dec 20, 2019

Thanks @ndellingwood for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants