Investigate half precision kernel performance #1075

e10harvey · 2021-08-04T13:44:12Z

When testing GMRES with half precision, @jennloe found performance drops with half_t when compared to single and double precision performance

Steps:

Investigate performance of batched GEMM with double, single, and half precision (related to Benchmark new batched GEMM interface #933)
Investigate performance of reductions with half precision
- Investigate performance of simple reduction with single and half precision
- Investigate blas GEMV performance with single and half precision prior to Use float as accumulator for GEMV on half_t (Fix #1081) #1082
Investigate performance of GMRES with double, single, and half precision

e10harvey · 2021-09-28T16:12:46Z

CC: @jennloe, @vqd8a, @srajama1

Investigate performance of batched GEMM with double, single, and half precision

nisght shows expected metrics when comparing batched GEMM with half precision to batched GEMM with single precision.
Looking at average speedup for --test=batched_heuristic for matrices in [2,4,6,...52] shows:

double -> float: ~1.5x speedup on average
float -> half_t: ~1.03x speedup on average

Half precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,Kokkos::Experimental::half_t,Kokkos::View<Kokkos::Experimenta$
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          36.93
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          41.60
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.85
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                          40.12
    peak_sustained_elapsed
    ---------------------------------------------------------------------- --------------- ------------------------------

Single precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,float,Kokkos::View<float***>,Kokkos::View<float***>,Kokkos::V$
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          70.55
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          49.33
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.87
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                          37.57
    peak_sustained_elapsed
    ---------------------------------------------------------------------- --------------- ------------------------------

Double precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,double,Kokkos::View<double***>,Kokkos::View<double***>,Kokkos::View<double***>,KokkosBatched::BoundsCheck::No,int=32,int=32,int=8>::__Functor<Kokkos::Impl::CudaTeamMember,int=4,int=4,int=8,int=8>,Kokkos::TeamPolicy<>,Kokkos::Cuda>>(KokkosBatched::Trans::NoTranspose), 2021-Sep-28 10:15:06, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          81.93
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          29.62
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.89
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_dadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_dfma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_dmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_               %                          44.65
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

dram throughout for half_t is ~1/2 the dram throughput for float.
Occupancy is reduced by ~8% for half_t in comparison to float.
Both the {f,h}fma and {f,h}mul match for half_t and float.
Flop efficiency is increased by ~2.5% for half_t in comparison to float.

e10harvey · 2021-09-28T16:59:04Z

CC: @jennloe, @vqd8a, @srajama1

Investigate performance of simple reduction with single and half precision

Timing output below shows ~1.004x speedup for half_t over float using lsum+=one;.
Timing output below shows ~1.008x speedup for half_t over flaot using lsum*=T(0.42);

Using the following test provided by @crtrott, we compare half precision reductions with single precision reductions.

#include <Kokkos_Core.hpp>
#include <cmath>

template<class T>
void run(int N, int R) {
  Kokkos::Timer timer;
  T result;
  const T one(1);
  for(int r=0; r<=R; r++) {
    Kokkos::parallel_reduce("test", Kokkos::RangePolicy<Kokkos::Cuda>(0, N), KOKKOS_LAMBDA(int i, T& lsum) {
        lsum+=one;
      },result);
    if(r==0) timer.reset();
  }
  printf("Time: %lf %lf sizeof: %i\n",timer.seconds(),double(result),int(sizeof(T)));
}

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {

    int N = argc > 1 ? atoi(argv[1]) : 10000;
    int R = argc > 2 ? atoi(argv[2]) : 10;

    run<Kokkos::Experimental::half_t>(N,R);
    Kokkos::fence();
    run<float>(N,R);
  }
  Kokkos::finalize();
}

Using `lsum+=one`

$ export OMP_PROC_BIND=spread
$ export OMP_PLACES=threads
$ N=512; ./KokkosCore_reduce $N 100000
Time: 2.203284 512.000000 sizeof: 2
Time: 2.216436 512.000000 sizeof: 4
$ N=100000; ./KokkosCore_reduce $N 100000
Time: 2.306173 inf sizeof: 2
Time: 2.315804 100000.000000 sizeof: 4

Half_t - `run<Kokkos::Experimental::half_t>(N,R);`

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runINS_12Experimental6half_tEEviiEUliRS6_E_NS_11RangePolicyIJNS_4CudaEEEES6_vEESB_NS_11InvalidTypeESA_EEEEvT_, 2021-Sep-28 10:24:16, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.08
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          11.91
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.36
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                          3,877
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Float - `run<float>(N,R);`

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runIfEviiEUliRfE_NS_11RangePolicyIJNS_4CudaEEEEfvEES9_NS_11InvalidTypeES8_EEEEvT_, 2021-Sep-28 10:24:17, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.06
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          12.06
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.41
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                          3,877
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Using `lsum*=T(0.42)`

$ N=512; ./KokkosCore_reduce $N 100000
Time: 2.190250 1.000000 sizeof: 2
Time: 2.198491 1.000000 sizeof: 4
$ N=100000; ./KokkosCore_reduce $N 100000
Time: 2.286281 0.419922 sizeof: 2
Time: 2.305678 0.420000 sizeof: 4

Half_t - `run<Kokkos::Experimental::half_t>(N,R);`

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runINS_12Experimental6half_tEEviiEUliRS6_E_NS_11RangePolicyIJNS_4CudaEEEES6_vEESB_NS_11InvalidTypeESA_EEEEvT_, 2021-Sep-28 10:35:52, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.08
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          11.80
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.36
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                          3,365
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                            512
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Float - `run<float>(N,R);`

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runIfEviiEUliRfE_NS_11RangePolicyIJNS_4CudaEEEEfvEES9_NS_11InvalidTypeES8_EEEEvT_, 2021-Sep-28 10:35:54, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.06
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          12.05
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.41
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                          3,365
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                            512
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Note that the ptx shows generation of f16.op rather than f16x2.op.

e10harvey · 2021-09-28T18:09:43Z

CC: @jennloe, @srajama1, @vqd8a

Investigate blas GEMV performance with single and half precision prior to #1082

Below we see that as n increases beyond 400, half_t provides a speedup over float.

Using the tip of kokkos develop and https://github.com/e10harvey/kokkos-kernels/tree/revert-1082 with the following local change:

$ git diff
diff --git a/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp b/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
index 1ad1289..408f5a5 100644
--- a/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
+++ b/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
@@ -122,7 +122,7 @@ void run(int m, int n, int repeat)
   using Scalar = double;
   using MemSpace = typename ExecSpace::memory_space;
   using Device = Kokkos::Device<ExecSpace, MemSpace>;
-  std::cout << "Running GEMV experiment (" << ExecSpace::name() << ")\n";
+  std::cout << "Running GEMV experiment (" << ExecSpace::name() << ") - " << typeid(Scalar).name() << "\n";
   Kokkos::View<Scalar**, Layout, Device> A(Kokkos::view_alloc(Kokkos::WithoutInitializing, "A"), m, n);
   Kokkos::View<Scalar*, Device> x(Kokkos::view_alloc(Kokkos::WithoutInitializing, "x"), n);
   Kokkos::View<Scalar*, Device> y(Kokkos::view_alloc(Kokkos::WithoutInitializing, "y"), m);

we see the following GEMV timing:

Half_t

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 10 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.002363 s.
Avg GEMV FLOP/s: 8.465e+09

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 400 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.002980 s.
Avg GEMV FLOP/s: 2.685e+11

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 1000 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.004403 s.
Avg GEMV FLOP/s: 4.543e+11

Float

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 10 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.002312 s.
Avg GEMV FLOP/s: 8.652e+09

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 400 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.003037 s.
Avg GEMV FLOP/s: 2.634e+11

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 1000 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.005188 s.
Avg GEMV FLOP/s: 3.855e+11

e10harvey · 2021-09-30T15:22:13Z

GMRES should be investigated with bfloat16 rather than float16. This will require the addition of a bhalf_t type in kokkos.

e10harvey added the help wanted label Aug 4, 2021

e10harvey self-assigned this Aug 4, 2021

jennloe mentioned this issue Aug 11, 2021

Very Slow GEMV performance for Kokkos::Experimental::half_t #1081

Closed

e10harvey closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate half precision kernel performance #1075

Investigate half precision kernel performance #1075

e10harvey commented Aug 4, 2021 •

edited

Loading

e10harvey commented Sep 28, 2021 •

edited

Loading

e10harvey commented Sep 28, 2021 •

edited

Loading

e10harvey commented Sep 28, 2021 •

edited

Loading

e10harvey commented Sep 30, 2021

Investigate half precision kernel performance #1075

Investigate half precision kernel performance #1075

Comments

e10harvey commented Aug 4, 2021 • edited Loading

e10harvey commented Sep 28, 2021 • edited Loading

Investigate performance of batched GEMM with double, single, and half precision

Half precision metrics

Single precision metrics

Double precision metrics

e10harvey commented Sep 28, 2021 • edited Loading

Investigate performance of simple reduction with single and half precision

Using lsum+=one

Half_t - run<Kokkos::Experimental::half_t>(N,R);

Float - run<float>(N,R);

Using lsum*=T(0.42)

Half_t - run<Kokkos::Experimental::half_t>(N,R);

Float - run<float>(N,R);

e10harvey commented Sep 28, 2021 • edited Loading

Investigate blas GEMV performance with single and half precision prior to #1082

Half_t

Float

e10harvey commented Sep 30, 2021

e10harvey commented Aug 4, 2021 •

edited

Loading

e10harvey commented Sep 28, 2021 •

edited

Loading

e10harvey commented Sep 28, 2021 •

edited

Loading

Using `lsum+=one`

Half_t - `run<Kokkos::Experimental::half_t>(N,R);`

Float - `run<float>(N,R);`

Using `lsum*=T(0.42)`

Half_t - `run<Kokkos::Experimental::half_t>(N,R);`

Float - `run<float>(N,R);`

e10harvey commented Sep 28, 2021 •

edited

Loading