Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos::ArithTraits<double>::nan() is very slow #35

Closed
ibaned opened this issue Jul 6, 2017 · 11 comments
Closed

Kokkos::ArithTraits<double>::nan() is very slow #35

ibaned opened this issue Jul 6, 2017 · 11 comments

Comments

@ibaned
Copy link
Contributor

ibaned commented Jul 6, 2017

I just got a report from Albany users (@lxmota and @calleman21). They switched from Teuchos::ScalarTraits<double>::nan() to Kokkos::ArithTraits<double>::nan() which slowed down their entire application by over 3X because all variables are initialized to NaN using this function.

KokkosKernels uses strtod() to implement this function (on the host), while Teuchos returns a global variable which is initialized to 0.0/0.0. @lxmota also recommended that we simply call std::numeric_traits<double>::quiet_NaN().

All the above also applies to float.

I think we should switch to either what Teuchos does or the quiet_NaN() from the standard library.

@crtrott @mhoemmen any thoughts?

I'll pick one of these and submit a PR soon.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jul 7, 2017

Kokkos::ArithTraits needs to work on the GPU, so you can't use std::numeric_limits if __CUDA_ARCH__ is defined. It could be that this works now, given how CUDA's constexpr support has been evolving. However, you'll have to try it to be sure.

If std::numeric_limits doesn't work, you could try this:
https://stackoverflow.com/questions/15514286/way-to-get-floating-point-special-values-in-cuda

@ibaned
Copy link
Contributor Author

ibaned commented Jul 7, 2017

Actually, this is the code right now, so the CUDA part is already separate:

#ifdef __CUDA_ARCH__
    return CUDART_NAN;
    //return nan (); // this returns 0 ???
#else
    // http://pubs.opengroup.org/onlinepubs/009696899/functions/nan.html
    return strtod ("NAN", (char**) NULL);
#endif // __CUDA_ARCH__

So we only have to worry about the slow host code, which has no GPU restrictions.

@hcedwar
Copy link
Contributor

hcedwar commented Jul 7, 2017

Suggest the strtod only be called once to initialize a singleton:

#ifdef __CUDA_ARCH__
__device__ inline
double kokkos_kernels_nan_function() // whatever 
{  return CUDART_NAN ; }
#else
inline
double kokkos_kernels_nan_function() // whatever 
{ static double x = strtod("NAN",(char**)NULL); return x ; }
#endif

@ibaned
Copy link
Contributor Author

ibaned commented Jul 7, 2017

This is quite similar to what Teuchos does. They also use a singleton, just a different initialization.

Any reason not to use the numeric_limits quiet_NAN to initialize the singleton instead of strtod ?

@ibaned
Copy link
Contributor Author

ibaned commented Jul 7, 2017

Actually, std::numeric_limits<double>::quiet_NaN() is constexpr in C++11, so calling it directly might be faster than loading a singleton in that case.

@crtrott
Copy link
Member

crtrott commented Jul 7, 2017

Just try out what is fastest and do that. Though I hope its the std::numeric_limits solution ;-)

@ibaned
Copy link
Contributor Author

ibaned commented Jul 7, 2017

Could someone add me to this repository so I can assign this to myself?

@crtrott
Copy link
Member

crtrott commented Jul 7, 2017

Ah this is Kokkos-Kernels totally overlooked this.

@ibaned ibaned self-assigned this Jul 7, 2017
@mhoemmen
Copy link
Contributor

mhoemmen commented Jul 7, 2017

std::numeric_limits should work. We could even make ArithTraits::nan constexpr in the non-CUDA case (and eventually, even in the CUDA case, since CUDA looks like it's starting to get more constexpr support).

Singletons could be bad if calling this in a thread-parallel context; we would want pthread_once in that case.

ibaned added a commit to ibaned/Trilinos that referenced this issue Jul 11, 2017
using this instead of strtod() and friends
dramatically speeds up the
Kokkos::ArithTraits<T>::nan()
function.
The performance test
packages/minitensor/test/perf_test_01.cc
went from 5.5 seconds to 0.6 seconds.

see kokkos/kokkos-kernels#35
ibaned added a commit to ibaned/Trilinos that referenced this issue Jul 11, 2017
using this instead of strtod() and friends
dramatically speeds up the
Kokkos::ArithTraits<T>::nan()
function.
The performance test
packages/minitensor/test/perf_test_01.cc
went from 5.5 seconds to 0.6 seconds.

see kokkos/kokkos-kernels#35

This is being patched directly into Trilinos
because (1) the performance issue is urgent
and (2) kokkos-kernels is undergoing heavy
refactoring and may not perform a formal
snapshot for a couple of months.
ibaned added a commit to ibaned/kokkos-kernels that referenced this issue Jul 11, 2017
using this instead of strtod() and friends
dramatically speeds up the
Kokkos::ArithTraits<T>::nan()
function.
The performance test
Trilinos/packages/minitensor/test/perf_test_01.cc
went from 5.5 seconds to 0.6 seconds.

see kokkos#35
@ibaned
Copy link
Contributor Author

ibaned commented Jul 11, 2017

trilinos/Trilinos#1490 fixes this in Trilinos, including the addition of a performance test in the affected MiniTensor package to measure the difference.

#36 fixes this in the KokkosKernels develop branch.

They are separate since KokkosKernels is unlikely to promote into Trilinos in the near term, and MiniTensor needs this now.

@mhoemmen
Copy link
Contributor

I submitted a trilinos/Trilinos#1490 review; thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants