Sorting utilities #461

brian-kelley · 2019-08-28T22:01:40Z

This commit adds a new utils file "KokkosKernels_Sorting.hpp". This adds some useful sorting routines for situations where Kokkos::sort can't be used (but may in the future, so these changes are in the Impl namespace).

16-bucket radix sort for integers (radixSort). The sorting can be called from inside a parallel region (RangePolicy, TeamPolicy or TeamThreadRange).
Version of the above radix sort "radixSort2" that applies the same swaps to another array, of arbitrary type
Bitonic sort implementation "bitonicSortTeam" that works within a TeamPolicy. This runs in parallel using a TeamThreadRange. This supports any data type that supports "operator<", "operator=", and copy constructor on device. A custom comparator can optionally be provided as a template argument. This makes it more general than Kokkos::sort. Bitonic is asymptotically slow at O(n log^2(n)). Generally it should only be used on GPUs where it is a very good fit for hardware, with high parallelism (>90% occupancy on CUDA) and coalesced memory references.
bitonicSort2, which behaves like radixSort2
bitonicSort, which is called from host and sorts a single view in parallel. Faster than Kokkos::sort for smallish arrays (< 10^7 elements) on GPUs.
both algorithms are tested in test_common_***

For CUDA execution space, I replaced radix sort with bitonicSortTeam in unsorted sparse matrix addition (spadd). This gave a ~12% overall speedup for Tpetra::MatrixMatrix::add on CUDA. The CPU exec spaces still use radix sort.

kokkos-dev spot check:
#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=752 run_time=361
clang-4.0.1-Pthread_Serial-release build_time=772 run_time=567
cuda-8.0.44-Cuda_OpenMP-release build_time=1125 run_time=341
gcc-5.3.0-Serial-hwloc-release build_time=666 run_time=313
gcc-5.3.0-Serial-release build_time=671 run_time=309
gcc-7.2.0-Serial-hwloc-release build_time=496 run_time=171
gcc-7.2.0-Serial-release build_time=499 run_time=169
intel-17.0.1-OpenMP-hwloc-release build_time=924 run_time=111
intel-17.0.1-OpenMP-release build_time=919 run_time=115
#######################################################
FAILED TESTS
#######################################################

bowman spot check:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1503 run_time=572
intel-16.4.258-Pthread_Serial-release build_time=2442 run_time=1221
intel-16.4.258-Serial-release build_time=1459 run_time=609
intel-17.2.174-OpenMP-release build_time=1932 run_time=370
intel-17.2.174-OpenMP_Serial-release build_time=2876 run_time=1100
intel-17.2.174-Pthread-release build_time=1300 run_time=604
intel-17.2.174-Pthread_Serial-release build_time=2396 run_time=1253
intel-17.2.174-Serial-release build_time=1382 run_time=649
intel-18.2.199-OpenMP-release build_time=1584 run_time=402
intel-18.2.199-OpenMP_Serial-release build_time=2760 run_time=993
intel-18.2.199-Pthread-release build_time=1230 run_time=617
intel-18.2.199-Pthread_Serial-release build_time=2377 run_time=1199
intel-18.2.199-Serial-release build_time=1202 run_time=653
#######################################################
FAILED TESTS
#######################################################

white spot check:
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1096 run_time=364
cuda-9.2.88-Cuda_OpenMP-release build_time=1156 run_time=303
gcc-6.4.0-OpenMP_Serial-release build_time=568 run_time=321
gcc-7.2.0-OpenMP-release build_time=400 run_time=132
gcc-7.2.0-OpenMP_Serial-release build_time=638 run_time=363
gcc-7.2.0-Serial-release build_time=249 run_time=183
ibm-16.1.0-Serial-release build_time=1334 run_time=266
#######################################################
FAILED TESTS
#######################################################

brian-kelley · 2019-08-28T22:22:58Z

@kyungjoo-kim Thanks for your suggestion about using Thrust sort. I will look into replacing this implementation with wrappers for that.

srajama1 · 2019-09-11T15:19:16Z

@brian-kelley : Thank you for adding these. Couple of things. Can we add documentation to internal wiki on when to use what, add the performance results to benchmark section of the wiki, add a github issue for new feature request (so we document it in the release), and add the performance test you are using to perf-test directory.

srajama1

@brian-kelley Thanks for creating this PR. See comments below.

srajama1 · 2019-09-17T15:19:57Z