Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two-level parallel version of transpose GEMV #514

Merged
merged 3 commits into from
Dec 13, 2019

Conversation

iyamazaki
Copy link
Contributor

No description provided.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to write an optimized GEMV? Why can't we just call cuBLAS for CUDA and the BLAS for non-CUDA, and have a slow fall-back for unsupported types?

const bool conj,
class IndexType = typename AViewType::size_type>
struct TwoLevelTransposeGEMV {
typedef typename YViewType::non_const_value_type y_value_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using new_type = old_type; C++11 type alias syntax to typedef old_type new_type;.

@ndellingwood
Copy link
Contributor

Cross-referencing #443
@mhoemmen I believe the intention of the PR is to resolve issues with the fallback impl in transpose mode, the 2-level parallelism approach seemed a natural way to implement.

@mhoemmen
Copy link
Contributor

@ndellingwood wrote:

I believe the intention of the PR is to resolve issues with the fallback impl in transpose mode, the 2-level parallelism approach seemed a natural way to implement.

Cool, I'm OK with it as long as it's fixing a bug.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. The usual kokkos-kernels idiom that Christian has been promoting, is to reduce the number of instantiations of the functor, by having the function that calls the functor assign the input Views to "canonical" View types. For example, the (outer) function might be templated on ViewType, but if the functor really only needs View<const T*, AnonymousSpace>, it would be best to have the functor be templated only on T.

src/blas/impl/KokkosBlas2_gemv_impl.hpp Show resolved Hide resolved
@iyamazaki
Copy link
Contributor Author

Thank you for the comments @mhoemmen. I am running the spot-check on white, but it is taking a long time @ndellingwood.

@ndellingwood
Copy link
Contributor

@iyamazaki spot-check on kokkos-dev-2 passed, merging in. Thanks!

Running on machine: kokkos-dev-2
Going to test compilers:  gcc/7.3.0 gcc/9.1 intel/18.0.5 clang/8.0 cuda/10.1
Testing compiler gcc/7.3.0
Testing compiler gcc/9.1
  Starting job gcc-7.3.0-OpenMP-release
  Starting job gcc-7.3.0-Pthread-release
  PASSED gcc-7.3.0-OpenMP-release
  Starting job gcc-9.1-OpenMP-release
  PASSED gcc-7.3.0-Pthread-release
Testing compiler intel/18.0.5
  Starting job gcc-9.1-Serial-release
  PASSED gcc-9.1-OpenMP-release
Testing compiler clang/8.0
  Starting job intel-18.0.5-OpenMP-release
  PASSED gcc-9.1-Serial-release
  Starting job clang-8.0-Cuda_OpenMP-release
  PASSED intel-18.0.5-OpenMP-release
Testing compiler cuda/10.1
  Starting job clang-8.0-Pthread_Serial-release
  PASSED clang-8.0-Cuda_OpenMP-release
  PASSED clang-8.0-Pthread_Serial-release
  Starting job cuda-10.1-Cuda_OpenMP-release
  PASSED cuda-10.1-Cuda_OpenMP-release
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=567 run_time=712
clang-8.0-Pthread_Serial-release build_time=356 run_time=883
cuda-10.1-Cuda_OpenMP-release build_time=626 run_time=660
gcc-7.3.0-OpenMP-release build_time=218 run_time=264
gcc-7.3.0-Pthread-release build_time=203 run_time=501
gcc-9.1-OpenMP-release build_time=186 run_time=235
gcc-9.1-Serial-release build_time=157 run_time=486
intel-18.0.5-OpenMP-release build_time=301 run_time=292
#######################################################

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants