-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting utilities #461
Sorting utilities #461
Conversation
@kyungjoo-kim Thanks for your suggestion about using Thrust sort. I will look into replacing this implementation with wrappers for that. |
79deef0
to
7bd18ed
Compare
@brian-kelley : Thank you for adding these. Couple of things. Can we add documentation to internal wiki on when to use what, add the performance results to benchmark section of the wiki, add a github issue for new feature request (so we document it in the release), and add the performance test you are using to perf-test directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-kelley Thanks for creating this PR. See comments below.
src/common/KokkosKernels_Sorting.hpp
Outdated
//Con: requires auxiliary storage, and this version only works for integers | ||
template<typename Ordinal, typename ValueType, typename Thread> | ||
KOKKOS_INLINE_FUNCTION void | ||
radixSort(ValueType* values, ValueType* valuesAux, Ordinal n, const Thread thread) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be useful to call teamRadixSort ? We use this convention for other kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a thread-local radix sort, with only vector-level parallelism inside. Should I call it threadRadixSort?
src/common/KokkosKernels_Sorting.hpp
Outdated
//While sorting, also permute "perm" array along with the values. | ||
//Pros: few diverging branches, so good for sorting on a single GPU thread/warp. | ||
//Con: requires auxiliary storage, this version only works for integers (although float/double is possible) | ||
template<typename Ordinal, typename ValueType, typename PermType, typename Thread> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the above call becomes teamRadixSort, this could be just radixSort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the same sorting algorithm (thread-local), the only difference is it permutes "perm" along with the values while sorting.
maxVal = values[i]; | ||
} | ||
//apply a bias so that key range always starts at 0 | ||
//also invert key values here for a descending sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work on Volta ? I thought you would need a barrier here, as you use the minVal rightaway. May be not because minVal is threadlocal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, everything inside radixSort/radixSort2 is thread-local, so I don't think warp divergence can cause any issues.
src/common/KokkosKernels_Sorting.hpp
Outdated
//Good diagram of the algorithm at https://en.wikipedia.org/wiki/Bitonic_sorter | ||
template<typename Ordinal, typename ValueType, typename TeamMember, typename Comparator = DefaultComparator<ValueType>> | ||
KOKKOS_INLINE_FUNCTION void | ||
bitonicSortTeam(ValueType* values, Ordinal n, const TeamMember mem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You went with team as a suffix. Let us make sure we are consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add "Thread" as a suffix to radix then, that seems the most clear.
|
||
//Sort "values", while applying the same swaps to "perm" | ||
template<typename Ordinal, typename ValueType, typename PermType, typename TeamMember, typename Comparator = DefaultComparator<ValueType>> | ||
KOKKOS_INLINE_FUNCTION void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Team is in suffix
src/common/KokkosKernels_Sorting.hpp
Outdated
else | ||
{ | ||
//Partition the data equally among fixed number of teams | ||
const Ordinal numTeams = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be some functon of n and the hardware ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to pick a power-of-two value that is greater than the number of SMs on any existing GPU, so that this function will get full occupancy. Does Kokkos have a way to get the number of SMs? With that I could compute this at runtime. Although, I don't think it will have a big effect on performance since CUDA will schedule multiple blocks onto one SM if there is register space (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 I just checked and Kokkos::Cuda::concurrency() returns the maximum concurrent number of threads over all SMs. Kokkos::Cuda does provide the SM count through cuda_internal_multiprocessor_count()
. I'll look at how other KK functions choose team size and then do something based on that.
@@ -310,6 +212,33 @@ namespace Experimental { | |||
CcolindsT ABpermAux; | |||
}; | |||
|
|||
#ifdef KOKKOS_ENABLE_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only on CUDA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 Because radix sort is the best on CPUs, and bitonic is the best on CUDA.This is just a template specialization of SortEntriesFunctor that chooses the best for the given execution space.
CheckSortedFunctor<ValView>(data), Kokkos::Min<int>(ordered)); | ||
ASSERT_TRUE(ordered); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to add a test that overrides the default comparator.
@@ -0,0 +1,271 @@ | |||
/* | |||
//@HEADER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a perf test, but I see comments on performance. How do you test performance ? If you have a test can you please commit it to perf-test. If you use the unit tests then that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the numbers I posted, I just use the unit tests with the array sizes temporarily set much higher. It would be good to copy those over into perf tests.
@brian-kelley : Just checking to see if this is ready to go in. |
Implemented fast sorting algorithms meant to be used inside parallel kernels, since Kokkos::sort is device only.
Using those now in sparse matrix add for unsorted matrix compression
It is the last template param for bitonicSort, bitonicSortTeam and bitonicSortTeam2. The default just does "operator<".
-Use reasonable fixed chunk size and vary number of teams -Follow KokkosBatched naming conventions (Serial or Team prefix) -Added testing for custom comparators in bitonicSort
7bd18ed
to
54178c2
Compare
Reran spot-check on kokkos-dev-2 with a merge of develop to confirm all tests are passing:
Merging, thanks @brian-kelley |
This commit adds a new utils file "KokkosKernels_Sorting.hpp". This adds some useful sorting routines for situations where Kokkos::sort can't be used (but may in the future, so these changes are in the Impl namespace).
For CUDA execution space, I replaced radix sort with bitonicSortTeam in unsorted sparse matrix addition (spadd). This gave a ~12% overall speedup for Tpetra::MatrixMatrix::add on CUDA. The CPU exec spaces still use radix sort.
kokkos-dev spot check:
#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=752 run_time=361
clang-4.0.1-Pthread_Serial-release build_time=772 run_time=567
cuda-8.0.44-Cuda_OpenMP-release build_time=1125 run_time=341
gcc-5.3.0-Serial-hwloc-release build_time=666 run_time=313
gcc-5.3.0-Serial-release build_time=671 run_time=309
gcc-7.2.0-Serial-hwloc-release build_time=496 run_time=171
gcc-7.2.0-Serial-release build_time=499 run_time=169
intel-17.0.1-OpenMP-hwloc-release build_time=924 run_time=111
intel-17.0.1-OpenMP-release build_time=919 run_time=115
#######################################################
FAILED TESTS
#######################################################
bowman spot check:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1503 run_time=572
intel-16.4.258-Pthread_Serial-release build_time=2442 run_time=1221
intel-16.4.258-Serial-release build_time=1459 run_time=609
intel-17.2.174-OpenMP-release build_time=1932 run_time=370
intel-17.2.174-OpenMP_Serial-release build_time=2876 run_time=1100
intel-17.2.174-Pthread-release build_time=1300 run_time=604
intel-17.2.174-Pthread_Serial-release build_time=2396 run_time=1253
intel-17.2.174-Serial-release build_time=1382 run_time=649
intel-18.2.199-OpenMP-release build_time=1584 run_time=402
intel-18.2.199-OpenMP_Serial-release build_time=2760 run_time=993
intel-18.2.199-Pthread-release build_time=1230 run_time=617
intel-18.2.199-Pthread_Serial-release build_time=2377 run_time=1199
intel-18.2.199-Serial-release build_time=1202 run_time=653
#######################################################
FAILED TESTS
#######################################################
white spot check:
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1096 run_time=364
cuda-9.2.88-Cuda_OpenMP-release build_time=1156 run_time=303
gcc-6.4.0-OpenMP_Serial-release build_time=568 run_time=321
gcc-7.2.0-OpenMP-release build_time=400 run_time=132
gcc-7.2.0-OpenMP_Serial-release build_time=638 run_time=363
gcc-7.2.0-Serial-release build_time=249 run_time=183
ibm-16.1.0-Serial-release build_time=1334 run_time=266
#######################################################
FAILED TESTS
#######################################################