-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate half precision kernel performance #1075
Comments
CC: @jennloe, @vqd8a, @srajama1 Investigate performance of batched GEMM with double, single, and half precisionnisght shows expected metrics when comparing batched GEMM with half precision to batched GEMM with single precision.
Half precision metrics
Single precision metrics
Double precision metrics
|
CC: @jennloe, @vqd8a, @srajama1 Investigate performance of simple reduction with single and half precisionTiming output below shows ~1.004x speedup for half_t over float using Using the following test provided by @crtrott, we compare half precision reductions with single precision reductions. #include <Kokkos_Core.hpp>
#include <cmath>
template<class T>
void run(int N, int R) {
Kokkos::Timer timer;
T result;
const T one(1);
for(int r=0; r<=R; r++) {
Kokkos::parallel_reduce("test", Kokkos::RangePolicy<Kokkos::Cuda>(0, N), KOKKOS_LAMBDA(int i, T& lsum) {
lsum+=one;
},result);
if(r==0) timer.reset();
}
printf("Time: %lf %lf sizeof: %i\n",timer.seconds(),double(result),int(sizeof(T)));
}
int main(int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
{
int N = argc > 1 ? atoi(argv[1]) : 10000;
int R = argc > 2 ? atoi(argv[2]) : 10;
run<Kokkos::Experimental::half_t>(N,R);
Kokkos::fence();
run<float>(N,R);
}
Kokkos::finalize();
} Using
|
CC: @jennloe, @srajama1, @vqd8a Investigate blas GEMV performance with single and half precision prior to #1082Below we see that as Using the tip of kokkos develop and https://github.com/e10harvey/kokkos-kernels/tree/revert-1082 with the following local change:
we see the following GEMV timing: Half_t
Float
|
GMRES should be investigated with bfloat16 rather than float16. This will require the addition of a |
CC: @srajama1
When testing GMRES with half precision, @jennloe found performance drops with half_t when compared to single and double precision performance
Steps:
The text was updated successfully, but these errors were encountered: