Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos::Controls and cusparse merge algorithm, see issue #670 #725

Merged
merged 9 commits into from
Jun 9, 2020

Conversation

lucbv
Copy link
Contributor

@lucbv lucbv commented May 22, 2020

@brian-kelley @kyungjoo-kim @srajama1

Brief content

This PR implements two new features:

  1. the Controls class that contains an UnorderedMap and handles for various backends.
  2. more logic is added in the cusparse SpMV backend to allow the merge algorithm to be used

Customers

The Controls feature is an enhancement for no particular customer.
The merge algorithm path is implemented to support Aria work by @vbrunini

Details on merge algorithm backend

The main work here is to add some logic to choose the right cusparseSpMVAlg_t argument for the SpMV routine. This is now decided based on the algorithm parameter passed to controls. Additional work has been performed to expand the number of type combinations in ETI, it is now possible to choose any combinations of:

Scalar={float, double}
OrdinalType={int, int64_t}
OffsetType={int, int64_t}
Layout={left, right}

Details on Controls

So far this is not very fleshed out, the new class contains a std::unordered_map<string, string> to store kernels parameters and it also stores backend handles for cusparse and cublas.
A new interface in KokkosSparse::spmv is implemented to accept a controls object that drives the algorithmic and performance parameters of the kernel. It also provides the backend handle, thus avoiding to create and destroy it within each call.

lucbv added 7 commits May 20, 2020 21:38
Something bugs out in the cmake logic. Could be a Lassen issue too.
This allows the cusparse specialization layer to use either
the default or the merge algorithm depending on input parameters
from the user.
This test needs to be guarded as the SpMV merge algorithm
is only available in cuSPARSE, not natively in Kokkos-Kernels
at this point.
Fixing a name clashing in sptrsv_cuSPARSE: error_t which is usually
used as a type not a variable.
Adding isParamter method to Controls.
Listing more parameters as controled parameters in SpMV
@lucbv lucbv self-assigned this May 22, 2020
@lucbv lucbv changed the title Kokkos::Controls and cusparse merge algorithm, see #670 Kokkos::Controls and cusparse merge algorithm, see issue #670 May 22, 2020
@lucbv
Copy link
Contributor Author

lucbv commented May 22, 2020

For some reason the issue in the title does not provide a link... so see issue #670

@lucbv
Copy link
Contributor Author

lucbv commented May 22, 2020

kokkos-dev: spot-check

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=352 run_time=177
clang-4.0.1-Pthread_Serial-release build_time=395 run_time=143
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=1080 run_time=241
clang-7.0.1-Cuda_OpenMP-release build_time=1080 run_time=244
cuda-9.2-Cuda_OpenMP-release build_time=1124 run_time=261
gcc-5.3.0-OpenMP-hwloc-release build_time=220 run_time=62
gcc-5.3.0-OpenMP-release build_time=216 run_time=61
gcc-7.3.0-Serial-hwloc-release build_time=186 run_time=65
gcc-7.3.0-Serial-release build_time=187 run_time=65
intel-16.0.3-Pthread-hwloc-release build_time=279 run_time=55
intel-16.0.3-Pthread-release build_time=289 run_time=66
intel-16.0.3-Serial-hwloc-release build_time=297 run_time=76
intel-16.0.3-Serial-release build_time=296 run_time=75
intel-17.0.1-OpenMP-hwloc-release build_time=474 run_time=58
intel-17.0.1-OpenMP-release build_time=412 run_time=54

kokkos-dev: spot-check-tpls

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=316 run_time=128
clang-4.0.1-Pthread_Serial-release build_time=314 run_time=129
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=1082 run_time=240
clang-7.0.1-Cuda_OpenMP-release build_time=1087 run_time=244
gcc-5.3.0-OpenMP-hwloc-release build_time=216 run_time=63
gcc-5.3.0-OpenMP-release build_time=216 run_time=64
gcc-7.3.0-Serial-hwloc-release build_time=191 run_time=66
gcc-7.3.0-Serial-release build_time=186 run_time=67
intel-16.0.3-Pthread-hwloc-release build_time=285 run_time=55
intel-16.0.3-Pthread-release build_time=289 run_time=64
intel-16.0.3-Serial-hwloc-release build_time=300 run_time=74
intel-16.0.3-Serial-release build_time=299 run_time=74
intel-17.0.1-OpenMP-hwloc-release build_time=415 run_time=52
intel-17.0.1-OpenMP-release build_time=418 run_time=52
#######################################################
FAILED TESTS
#######################################################
cuda-9.2-Cuda_OpenMP-release (build failed)
#######################################################

so close but need to clean-up something in the tpl handles management... let's put a band-aid on it for now.

Copy link
Contributor

@brian-kelley brian-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv I just had a couple questions, and the duplicated copyright headers in one file. Other than that, everything looks good.

src/impl/tpls/KokkosKernels_tpl_handles.cpp Outdated Show resolved Hide resolved
if(std::is_same<value_type, float>::value) {myCudaDataType = CUDA_R_32F;}
if(std::is_same<value_type, double>::value) {myCudaDataType = CUDA_R_64F;}
if(std::is_same<value_type, float>::value) {myCudaDataType = CUDA_R_32F;}
if(std::is_same<value_type, double>::value) {myCudaDataType = CUDA_R_64F;}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why CUDA_C_32F and CUDA_C_64F aren't supported here? It's ok if those use the fallback SPMV for now, but if it will be done later, we should make a note/issue for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I answered that in your comment below.

KOKKOSSPARSE_SPMV_CUSPARSE(float, int, int, Kokkos::LayoutRight, true)
// KOKKOSSPARSE_SPMV_CUSPARSE(float, int, int, Kokkos::LayoutRight, false)
#if defined(CUSPARSE_VERSION) && (10300 <= CUSPARSE_VERSION)
KOKKOSSPARSE_SPMV_CUSPARSE(double, int64_t, int, Kokkos::LayoutLeft, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we added the complex cudaDataTypes above, we could instantiate all of these for complex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should, I have not had time to look into complex yet though.
If in your experience it is easy to just instantiate and have it work OK through cusparse then I can do it. If it is an unknown I would suggest holding-off on it and do it in a later PR.

// y is the quantity being tested here,
// so let us use y_value_type to determine
// the appropriate tolerance precision.
const y_value_mag_type eps = std::is_same<y_value_mag_type, float>::value ? 2*1e-3 : 1e-7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change fix the SPMV unit tests for float (#715)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a new test so it won't fix the old one, this should look pretty much the same as the old one though which means that we might have to do something about it in the future, although this test uses cuSPARSE so that algo might not have the same issues as native kokkos-kenrels

@brian-kelley brian-kelley self-requested a review May 22, 2020 22:57
@lucbv
Copy link
Contributor Author

lucbv commented May 24, 2020

spot check cleared on kokkos-dev:

spot-check

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=349 run_time=228
clang-4.0.1-Pthread_Serial-release build_time=348 run_time=143
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=1259 run_time=248
clang-7.0.1-Cuda_OpenMP-release build_time=1437 run_time=244
cuda-9.2-Cuda_OpenMP-release build_time=1284 run_time=275
gcc-5.3.0-OpenMP-hwloc-release build_time=239 run_time=86
gcc-5.3.0-OpenMP-release build_time=241 run_time=98
gcc-7.3.0-Serial-hwloc-release build_time=202 run_time=73
gcc-7.3.0-Serial-release build_time=204 run_time=71
intel-16.0.3-Pthread-hwloc-release build_time=303 run_time=120
intel-16.0.3-Pthread-release build_time=306 run_time=75
intel-16.0.3-Serial-hwloc-release build_time=319 run_time=89
intel-16.0.3-Serial-release build_time=319 run_time=88
intel-17.0.1-OpenMP-hwloc-release build_time=461 run_time=58
intel-17.0.1-OpenMP-release build_time=507 run_time=58

spot-check-tpls

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=430 run_time=308
clang-4.0.1-Pthread_Serial-release build_time=345 run_time=182
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=1253 run_time=251
clang-7.0.1-Cuda_OpenMP-release build_time=1248 run_time=247
cuda-9.2-Cuda_OpenMP-release build_time=1312 run_time=282
gcc-5.3.0-OpenMP-hwloc-release build_time=245 run_time=86
gcc-5.3.0-OpenMP-release build_time=244 run_time=99
gcc-7.3.0-Serial-hwloc-release build_time=202 run_time=67
gcc-7.3.0-Serial-release build_time=205 run_time=73
intel-16.0.3-Pthread-hwloc-release build_time=374 run_time=99
intel-16.0.3-Pthread-release build_time=306 run_time=74
intel-16.0.3-Serial-hwloc-release build_time=317 run_time=87
intel-16.0.3-Serial-release build_time=315 run_time=85
intel-17.0.1-OpenMP-hwloc-release build_time=464 run_time=55
intel-17.0.1-OpenMP-release build_time=476 run_time=54

@lucbv
Copy link
Contributor Author

lucbv commented May 24, 2020

I do not have access to kokkos-dev2 yet so if someone feels like doing:
git fetch upstream pull/725/head spmv_merge
to get a local copy on kokkos-dev2 and running the spot-check there, that would be a great help moving this forward.

Copy link
Contributor

@kyungjoo-kim kyungjoo-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With runtime control parameters, do you have a rough idea of how much overhead is added for default algorihtm ? I do not mean the time increased by the map indexing(this should be trivial). I mean unexpected time increase which might be due to binery size increase and so on.

Is unordered map better than map ? When the directory size is small (like the control object), the indexing time is hardly different and the std::unordered_map could have a larger default hash map size (I do not know what is default hash map size).

@brian-kelley
Copy link
Contributor

@lucbv I'm running spot check on kokkos-dev2 now, and I'll get on the Tpetra integration this week.

@brian-kelley
Copy link
Contributor

@lucbv Everything's clean, except for one build with a broken configuration:

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=613 run_time=132
clang-8.0-Pthread_Serial-release build_time=200 run_time=110
clang-9.0.0-Pthread-release build_time=119 run_time=50
clang-9.0.0-Serial-release build_time=126 run_time=57
cuda-10.1-Cuda_OpenMP-release build_time=787 run_time=128
gcc-4.8.4-OpenMP-release build_time=118 run_time=59
gcc-7.3.0-OpenMP-release build_time=140 run_time=57
gcc-7.3.0-Pthread-release build_time=118 run_time=53
gcc-8.3.0-Serial-release build_time=137 run_time=58
gcc-9.1-OpenMP-release build_time=175 run_time=57
gcc-9.1-Serial-release build_time=153 run_time=58
intel-17.0.1-Serial-release build_time=245 run_time=57
intel-18.0.5-OpenMP-release build_time=354 run_time=54
intel-19.0.5-Pthread-release build_time=421 run_time=48
#######################################################
FAILED TESTS
#######################################################
cuda-9.2-Cuda_Serial-release (build failed)
#######################################################

Here's what the failing one printed:

-- Check for working CXX compiler: /ascldap/users/bmkelle/Testing725/kokkos/bin/nvcc_wrapper -- broken
CMake Error at /net/watson.sandia.gov/storage/fast/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.12.2/share/cmake-3.12/Modules/CMakeTestCXXCompiler.cmake:45 (message):
  The C++ compiler

    "/ascldap/users/bmkelle/Testing725/kokkos/bin/nvcc_wrapper"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /ascldap/users/bmkelle/Testing725/Testing/TestAll_2020-05-26_12.20.58/cuda/9.2/Cuda_Serial-release/kokkos-install/CMakeFiles/CMakeTmp

    Run Build Command:"/usr/bin/gmake" "cmTC_5d19c/fast"
    /usr/bin/gmake -f CMakeFiles/cmTC_5d19c.dir/build.make CMakeFiles/cmTC_5d19c.dir/build
    gmake[1]: Entering directory `/home/bmkelle/Testing725/Testing/TestAll_2020-05-26_12.20.58/cuda/9.2/Cuda_Serial-release/kokkos-install/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_5d19c.dir/testCXXCompiler.cxx.o
    /ascldap/users/bmkelle/Testing725/kokkos/bin/nvcc_wrapper    -O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized    -o CMakeFiles/cmTC_5d19c.dir/testCXXCompiler.cxx.o -c /ascldap/users/bmkelle/Testing725/Testing/TestAll_2020-05-26_12.20.58/cuda/9.2/Cuda_Serial-release/kokkos-install/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    sh: cicc: command not found
    gmake[1]: *** [CMakeFiles/cmTC_5d19c.dir/testCXXCompiler.cxx.o] Error 127
    gmake[1]: Leaving directory `/home/bmkelle/Testing725/Testing/TestAll_2020-05-26_12.20.58/cuda/9.2/Cuda_Serial-release/kokkos-install/CMakeFiles/CMakeTmp'
    gmake: *** [cmTC_5d19c/fast] Error 2

So maybe something broke in the module loading for that build. Anyway, that's not a problem with these changes.

@brian-kelley
Copy link
Contributor

@ndellingwood Do you know what cicc is? As far as I can tell, the modules are OK ("which nvcc" points to /projects/sems/install/rhel7-x86_64/sems/compiler/cuda/9.2/base/bin which is correct). But "which cicc" doesn't find anything.

@ndellingwood
Copy link
Contributor

ndellingwood commented May 26, 2020

@brian-kelley No, I'm not sure what that is, something messed up with the module it seems. I just logged onto kokkos-dev-2 and tried to compile a simple cuda driver code with sems-cuda/9.2 module and had the same error:

cuda_device.cu

[ndellin@kokkos-dev-2 CudaDevice]$ cat cuda_device.cu 
#include <cstdio>

// Load modules
//   module load sems-cuda/9.2

// Compilation line:
//   nvcc cuda_device.cu

// Run code:
//   ./a.out

// Expected output:
// Device names and ids...

// See https://devblogs.nvidia.com/how-query-device-properties-and-handle-errors-cuda-cc/

int main() {
  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Device name: %s\n", prop.name);
  }
}

Load module

module load sems-cuda/9.2

Compilation attempt:

[ndellin@kokkos-dev-2 CudaDevice]$ nvcc cuda_device.cu 
sh: cicc: command not found

@ndellingwood
Copy link
Contributor

@brian-kelley swapping 9.2 for 10.1 and I had no issues compiling as expected since the 10.1 tests you ran passed. I'll test my small driver code on another system with the sems-cuda/9.2 module to determine if it's the module or some issue on kokkos-dev-2 with the module and email to request a fix.

@ndellingwood
Copy link
Contributor

@brian-kelley the driver compiled fine on kokkos-dev with sems-cuda/9.2. Let me email a sys admin for help.

@lucbv
Copy link
Contributor Author

lucbv commented May 26, 2020

@brian-kelley @ndellingwood
unfortunate event but at least not impacting, I am also happy to see that cuda 10.1 worked, I was using a similar version on Lassen but not exactly the same so I was not 100% sure.

@brian-kelley
Copy link
Contributor

@lucbv I reran spot checks on kokkos-dev2 since the CUDA 9 seems OK now:
Regular spot-check:

#######################################################
PASSED TESTS 
#######################################################
clang-8.0-Pthread_Serial-release build_time=214 run_time=105
clang-9.0.0-Pthread-release build_time=133 run_time=53
clang-9.0.0-Serial-release build_time=135 run_time=53
gcc-4.8.4-OpenMP-release build_time=111 run_time=49
gcc-7.3.0-OpenMP-release build_time=136 run_time=50
gcc-7.3.0-Pthread-release build_time=154 run_time=44
gcc-8.3.0-Serial-release build_time=134 run_time=47
gcc-9.1-OpenMP-release build_time=165 run_time=51
gcc-9.1-Serial-release build_time=152 run_time=48
#######################################################
FAILED TESTS 
#######################################################
clang-8.0-Cuda_OpenMP-release (test failed)
cuda-10.1-Cuda_OpenMP-release (test failed)
cuda-9.2-Cuda_Serial-release (test failed)
intel-17.0.1-Serial-release (test failed)
intel-18.0.5-OpenMP-release (test failed)
intel-19.0.5-Pthread-release (test failed)

Failures were all #715 .

TPLs spot check:

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Pthread_Serial-release build_time=205 run_time=115
clang-9.0.0-Pthread-release build_time=126 run_time=58
clang-9.0.0-Serial-release build_time=123 run_time=59
cuda-10.1-Cuda_OpenMP-release build_time=1103 run_time=1801
gcc-4.8.4-OpenMP-release build_time=122 run_time=56
gcc-7.3.0-OpenMP-release build_time=144 run_time=58
gcc-7.3.0-Pthread-release build_time=117 run_time=52
gcc-8.3.0-Serial-release build_time=141 run_time=61
gcc-9.1-OpenMP-release build_time=182 run_time=57
gcc-9.1-Serial-release build_time=168 run_time=57
intel-17.0.1-Serial-release build_time=254 run_time=57
intel-18.0.5-OpenMP-release build_time=361 run_time=87
intel-19.0.5-Pthread-release build_time=478 run_time=55
#######################################################
FAILED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release (test failed)

I didn't run this one with float. The only failure was actually gauss_seidel timing out on OpenMP (that will be fixed soon).

Since this PR looks good in testing, is there anything else blocking it?

@brian-kelley brian-kelley merged commit cb30ac0 into kokkos:develop Jun 9, 2020
@lucbv lucbv deleted the Controls branch August 31, 2022 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants