Dot based gemm #490

seheracer · 2019-11-05T00:14:42Z

Description:

This PR contains an improved version of KokkosBlas::gemm for CUDA: DotBasedGEMM. DotBasedGEMM implements the optimization for C = betaC + alphaA^TB with A and B matrices both being tall and skinny. C matrix is assumably small, so each entry of C is computed by performing the dot product of respective columns of A and B matrices. Note that the dot products are performed on very long vectors, so each dot product is distributed among multiple teams.

When the conditions of having tall and skinny matrices in the form alpha*A^TB hold, instead of calling CUBLAS' gemm, DotBasedGEMM is called. This is considered as an improvement over CUBLAS' gemm, so DotBasedGEMM never takes place if CUBLAS is not enabled.

Output of test_all_sandia on white (without and with cublas):

../scripts/test_all_sandia --spot-check --arch=Power8,Pascal60

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-Serial-release
Testing compiler cuda/9.2.88
  Starting job ibm-16.1.0-Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/10.0.130
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1124 run_time=283
cuda-9.2.88-Cuda_OpenMP-release build_time=1034 run_time=222
gcc-6.4.0-OpenMP_Serial-release build_time=555 run_time=290
gcc-7.2.0-OpenMP-release build_time=390 run_time=107
gcc-7.2.0-OpenMP_Serial-release build_time=623 run_time=368
gcc-7.2.0-Serial-release build_time=233 run_time=182
ibm-16.1.0-Serial-release build_time=1336 run_time=262
#######################################################
FAILED TESTS
#######################################################

../scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas --arch=Power8,Pascal60

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
Testing compiler cuda/10.0.130
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1177 run_time=287
cuda-9.2.88-Cuda_OpenMP-release build_time=1206 run_time=223
#######################################################
FAILED TESTS
#######################################################

ndellingwood · 2019-11-05T04:07:01Z

@seheracer please edit the PR to change the base to the develop branch instead of master, that should clean up a lot of the commit history.

srajama1

@seheracer Thanks a lot !

src/impl/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp

… in the initializer list.

ndellingwood · 2019-11-06T03:43:40Z

Thanks @seheracer !

jhux2 · 2019-11-06T21:18:19Z

Thank you, @seheracer. Could this also be applied to Trilinos dev so that EMPIRE can use it soon?

seheracer · 2019-11-06T23:26:45Z

Thank you, @seheracer. Could this also be applied to Trilinos dev so that EMPIRE can use it soon?

@jhux2, I pushed the same PR into Trilinos develop too: trilinos/Trilinos#6226

seheracer added 3 commits November 2, 2019 22:10

Dot Based GEMM for tall and skinny matrices

4e0955b

Merge remote-tracking branch 'origin/develop' into dot_based_gemm

4bf5762

Added some comments in DotBasedGEMM.

2527223

seheracer added the enhancement label Nov 5, 2019

seheracer requested a review from srajama1 November 5, 2019 00:14

seheracer self-assigned this Nov 5, 2019

seheracer changed the base branch from master to develop November 5, 2019 15:14

srajama1 approved these changes Nov 5, 2019

View reviewed changes

src/impl/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp Show resolved Hide resolved

kyungjoo-kim reviewed Nov 5, 2019

View reviewed changes

src/impl/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp Show resolved Hide resolved

mhoemmen reviewed Nov 5, 2019

View reviewed changes

Named some constants, changed typedefs to using, covered more members…

8ffa886

… in the initializer list.

seheracer mentioned this pull request Nov 6, 2019

KokkosKernels: Optimized GEMM for A^TB with tall and skinny matrices on CUDA. trilinos/Trilinos#6226

Merged

ndellingwood merged commit f587956 into kokkos:develop Nov 6, 2019

ndellingwood mentioned this pull request Jan 28, 2020

Kokkos + KokkosKernels Promotion To 2.9.99 trilinos/Trilinos#6671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dot based gemm #490

Dot based gemm #490

seheracer commented Nov 5, 2019

ndellingwood commented Nov 5, 2019

srajama1 left a comment

ndellingwood commented Nov 6, 2019

jhux2 commented Nov 6, 2019

seheracer commented Nov 6, 2019

Dot based gemm #490

Dot based gemm #490

Conversation

seheracer commented Nov 5, 2019

ndellingwood commented Nov 5, 2019

srajama1 left a comment

Choose a reason for hiding this comment

ndellingwood commented Nov 6, 2019

jhux2 commented Nov 6, 2019

seheracer commented Nov 6, 2019