Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate half precision kernel performance #1075

Closed
5 tasks done
e10harvey opened this issue Aug 4, 2021 · 4 comments
Closed
5 tasks done

Investigate half precision kernel performance #1075

e10harvey opened this issue Aug 4, 2021 · 4 comments
Assignees

Comments

@e10harvey
Copy link
Contributor

e10harvey commented Aug 4, 2021

CC: @srajama1

When testing GMRES with half precision, @jennloe found performance drops with half_t when compared to single and double precision performance

Steps:

@e10harvey
Copy link
Contributor Author

e10harvey commented Sep 28, 2021

CC: @jennloe, @vqd8a, @srajama1

Investigate performance of batched GEMM with double, single, and half precision

nisght shows expected metrics when comparing batched GEMM with half precision to batched GEMM with single precision.
Looking at average speedup for --test=batched_heuristic for matrices in [2,4,6,...52] shows:
gemm-square-82k-LL
gemm-square-82k-LR

  • double -> float: ~1.5x speedup on average
  • float -> half_t: ~1.03x speedup on average

Half precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,Kokkos::Experimental::half_t,Kokkos::View<Kokkos::Experimenta$
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          36.93
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          41.60
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.85
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                          40.12
    peak_sustained_elapsed
    ---------------------------------------------------------------------- --------------- ------------------------------

Single precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,float,Kokkos::View<float***>,Kokkos::View<float***>,Kokkos::V$
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          70.55
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          49.33
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.87
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                          37.57
    peak_sustained_elapsed
    ---------------------------------------------------------------------- --------------- ------------------------------

Double precision metrics

  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBatched::Impl::BatchedDblBufGemm<KokkosBatched::Trans::NoTranspose,KokkosBatched::Trans::NoTranspose,KokkosBatched::BatchLayout::Left,KokkosBatched::BatchedGemmHandle,double,Kokkos::View<double***>,Kokkos::View<double***>,Kokkos::View<double***>,KokkosBatched::BoundsCheck::No,int=32,int=32,int=8>::__Functor<Kokkos::Impl::CudaTeamMember,int=4,int=4,int=8,int=8>,Kokkos::TeamPolicy<>,Kokkos::Cuda>>(KokkosBatched::Trans::NoTranspose), 2021-Sep-28 10:15:06, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                          81.93
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          29.62
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                          99.89
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_dadd_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_dfma_pred_on.sum                               inst                  5,536,481,280
    smsp__sass_thread_inst_executed_op_dmul_pred_on.sum                               inst                  5,368,709,120
    smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_               %                          44.65
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------
  • dram throughout for half_t is ~1/2 the dram throughput for float.
  • Occupancy is reduced by ~8% for half_t in comparison to float.
  • Both the {f,h}fma and {f,h}mul match for half_t and float.
  • Flop efficiency is increased by ~2.5% for half_t in comparison to float.

@e10harvey
Copy link
Contributor Author

e10harvey commented Sep 28, 2021

CC: @jennloe, @vqd8a, @srajama1

Investigate performance of simple reduction with single and half precision

Timing output below shows ~1.004x speedup for half_t over float using lsum+=one;.
Timing output below shows ~1.008x speedup for half_t over flaot using lsum*=T(0.42);

Using the following test provided by @crtrott, we compare half precision reductions with single precision reductions.

#include <Kokkos_Core.hpp>
#include <cmath>

template<class T>
void run(int N, int R) {
  Kokkos::Timer timer;
  T result;
  const T one(1);
  for(int r=0; r<=R; r++) {
    Kokkos::parallel_reduce("test", Kokkos::RangePolicy<Kokkos::Cuda>(0, N), KOKKOS_LAMBDA(int i, T& lsum) {
        lsum+=one;
      },result);
    if(r==0) timer.reset();
  }
  printf("Time: %lf %lf sizeof: %i\n",timer.seconds(),double(result),int(sizeof(T)));
}

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {

    int N = argc > 1 ? atoi(argv[1]) : 10000;
    int R = argc > 2 ? atoi(argv[2]) : 10;

    run<Kokkos::Experimental::half_t>(N,R);
    Kokkos::fence();
    run<float>(N,R);
  }
  Kokkos::finalize();
}

Using lsum+=one

$ export OMP_PROC_BIND=spread
$ export OMP_PLACES=threads
$ N=512; ./KokkosCore_reduce $N 100000
Time: 2.203284 512.000000 sizeof: 2
Time: 2.216436 512.000000 sizeof: 4
$ N=100000; ./KokkosCore_reduce $N 100000
Time: 2.306173 inf sizeof: 2
Time: 2.315804 100000.000000 sizeof: 4

Half_t - run<Kokkos::Experimental::half_t>(N,R);

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runINS_12Experimental6half_tEEviiEUliRS6_E_NS_11RangePolicyIJNS_4CudaEEEES6_vEESB_NS_11InvalidTypeESA_EEEEvT_, 2021-Sep-28 10:24:16, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.08
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          11.91
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.36
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                          3,877
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Float - run<float>(N,R);

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runIfEviiEUliRfE_NS_11RangePolicyIJNS_4CudaEEEEfvEES9_NS_11InvalidTypeES8_EEEEvT_, 2021-Sep-28 10:24:17, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.06
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          12.06
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.41
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                          3,877
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Using lsum*=T(0.42)

$ N=512; ./KokkosCore_reduce $N 100000
Time: 2.190250 1.000000 sizeof: 2
Time: 2.198491 1.000000 sizeof: 4
$ N=100000; ./KokkosCore_reduce $N 100000
Time: 2.286281 0.419922 sizeof: 2
Time: 2.305678 0.420000 sizeof: 4

Half_t - run<Kokkos::Experimental::half_t>(N,R);

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runINS_12Experimental6half_tEEviiEUliRS6_E_NS_11RangePolicyIJNS_4CudaEEEES6_vEESB_NS_11InvalidTypeESA_EEEEvT_, 2021-Sep-28 10:35:52, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.08
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          11.80
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.36
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_hadd_pred_on.sum                               inst                          3,365
    smsp__sass_thread_inst_executed_op_hfma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_hmul_pred_on.sum                               inst                            512
    smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Float - run<float>(N,R);

_ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_18CudaFunctorAdapterIZ3runIfEviiEUliRfE_NS_11RangePolicyIJNS_4CudaEEEEfvEES9_NS_11InvalidTypeES8_EEEEvT_, 2021-Sep-28 10:35:54, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__throughput.avg.pct_of_peak_sustained_elapsed                                   %                           0.06
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          12.05
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                                %                           1.41
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                                                    (!) n/a
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                          3,365
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                              0
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                            512
    smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_               %                           0.01
    peak_sustained_elapsed                                                                                               
    ---------------------------------------------------------------------- --------------- ------------------------------

Note that the ptx shows generation of f16.op rather than f16x2.op.

@e10harvey
Copy link
Contributor Author

e10harvey commented Sep 28, 2021

CC: @jennloe, @srajama1, @vqd8a

Investigate blas GEMV performance with single and half precision prior to #1082

Below we see that as n increases beyond 400, half_t provides a speedup over float.

Using the tip of kokkos develop and https://github.com/e10harvey/kokkos-kernels/tree/revert-1082 with the following local change:

$ git diff
diff --git a/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp b/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
index 1ad1289..408f5a5 100644
--- a/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
+++ b/perf_test/blas/blas2/KokkosBlas2_gemv_perf_test.cpp
@@ -122,7 +122,7 @@ void run(int m, int n, int repeat)
   using Scalar = double;
   using MemSpace = typename ExecSpace::memory_space;
   using Device = Kokkos::Device<ExecSpace, MemSpace>;
-  std::cout << "Running GEMV experiment (" << ExecSpace::name() << ")\n";
+  std::cout << "Running GEMV experiment (" << ExecSpace::name() << ") - " << typeid(Scalar).name() << "\n";
   Kokkos::View<Scalar**, Layout, Device> A(Kokkos::view_alloc(Kokkos::WithoutInitializing, "A"), m, n);
   Kokkos::View<Scalar*, Device> x(Kokkos::view_alloc(Kokkos::WithoutInitializing, "x"), n);
   Kokkos::View<Scalar*, Device> y(Kokkos::view_alloc(Kokkos::WithoutInitializing, "y"), m);

we see the following GEMV timing:

Half_t

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 10 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.002363 s.
Avg GEMV FLOP/s: 8.465e+09

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 400 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.002980 s.
Avg GEMV FLOP/s: 2.685e+11

$ ./KokkosBlas2_gemv_perf_test.h --m 1000000 --n 1000 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - N6Kokkos12Experimental6half_tE
Avg GEMV time: 0.004403 s.
Avg GEMV FLOP/s: 4.543e+11

Float

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 10 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.002312 s.
Avg GEMV FLOP/s: 8.652e+09

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 400 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.003037 s.
Avg GEMV FLOP/s: 2.634e+11

$ ./KokkosBlas2_gemv_perf_test.f --m 1000000 --n 1000 --layout right --cuda 0 --repeat 1000
Running GEMV experiment (Cuda) - f
Avg GEMV time: 0.005188 s.
Avg GEMV FLOP/s: 3.855e+11

@e10harvey
Copy link
Contributor Author

GMRES should be investigated with bfloat16 rather than float16. This will require the addition of a bhalf_t type in kokkos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant