You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@jennloe saw a substantial speedup in MultiVector * MultiVector product by calling gemv when the RHS has only one column. This could be implemented in KokkosKernels easily by adding a special path in KokkosBlas::gemm() to call gemv instead (outside of the unification layer). We just have to find the right heuristics for when this should be done. The things to consider are TPLs (Jennifer's results were with cublas gemm), layout of the LHS matrix, and dimensions.
The text was updated successfully, but these errors were encountered:
brian-kelley
changed the title
KokkosBlas::gemm should use a gemv kernel if the 2nd arg has only 1 column
KokkosBlas::gemm should use a gemv kernel if the RHS has only 1 column
Apr 6, 2021
Performance measurements on V100, double precision. A (LHS) is m*n, B (RHS) is n x 1, 1 <= n <= 50 in the ICGS orthogonalization use case.
m = 1 million, just testing n = 1 and n = 50.
n = 1
Flops (KK GEMM)
Flops (cuBLAS GEMM)
Flops (KK GEMV)
Flops (cuBLAS GEMV)
A LayoutLeft, B LayoutLeft
9.005e+08
2.878e+10
3.111e+10
5.425e+10
A LayoutLeft, B LayoutRight
1.017e+09
1.007e+09
A LayoutRight, B LayoutLeft
9.144e+08
9.140e+08
1.263e+09
1.263e+09
A LayoutRight, B LayoutRight
1.051e+09
1.051e+09
n = 50
A LayoutLeft, B LayoutLeft
1.443e+10
1.978e+11
1.143e+11
1.978e+11
A LayoutLeft, B LayoutRight
1.675e+10
1.675e+10
A LayoutRight, B LayoutLeft
1.501e+10
1.500e+10
6.315e+10
6.314e+10
A LayoutRight, B LayoutRight
1.794e+10
1.794e+10
This suggests that cuBLAS GEMM isn't even getting called except for the LayoutLeft/LayoutLeft case, and that using GEMV instead results in a significant improvement in all the cases (except for cublas, n = 50, left/left, where it seems cuBLAS is already using GEMV in the k=1 case).
See discussion at trilinos/Trilinos#8923
@jennloe saw a substantial speedup in MultiVector * MultiVector product by calling gemv when the RHS has only one column. This could be implemented in KokkosKernels easily by adding a special path in
KokkosBlas::gemm()
to call gemv instead (outside of the unification layer). We just have to find the right heuristics for when this should be done. The things to consider are TPLs (Jennifer's results were with cublas gemm), layout of the LHS matrix, and dimensions.The text was updated successfully, but these errors were encountered: