Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting utilities #461

Merged
merged 6 commits into from
Dec 6, 2019
Merged

Conversation

brian-kelley
Copy link
Contributor

@brian-kelley brian-kelley commented Aug 28, 2019

This commit adds a new utils file "KokkosKernels_Sorting.hpp". This adds some useful sorting routines for situations where Kokkos::sort can't be used (but may in the future, so these changes are in the Impl namespace).

  • 16-bucket radix sort for integers (radixSort). The sorting can be called from inside a parallel region (RangePolicy, TeamPolicy or TeamThreadRange).
  • Version of the above radix sort "radixSort2" that applies the same swaps to another array, of arbitrary type
  • Bitonic sort implementation "bitonicSortTeam" that works within a TeamPolicy. This runs in parallel using a TeamThreadRange. This supports any data type that supports "operator<", "operator=", and copy constructor on device. A custom comparator can optionally be provided as a template argument. This makes it more general than Kokkos::sort. Bitonic is asymptotically slow at O(n log^2(n)). Generally it should only be used on GPUs where it is a very good fit for hardware, with high parallelism (>90% occupancy on CUDA) and coalesced memory references.
  • bitonicSort2, which behaves like radixSort2
  • bitonicSort, which is called from host and sorts a single view in parallel. Faster than Kokkos::sort for smallish arrays (< 10^7 elements) on GPUs.
  • both algorithms are tested in test_common_***

For CUDA execution space, I replaced radix sort with bitonicSortTeam in unsorted sparse matrix addition (spadd). This gave a ~12% overall speedup for Tpetra::MatrixMatrix::add on CUDA. The CPU exec spaces still use radix sort.

kokkos-dev spot check:
#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=752 run_time=361
clang-4.0.1-Pthread_Serial-release build_time=772 run_time=567
cuda-8.0.44-Cuda_OpenMP-release build_time=1125 run_time=341
gcc-5.3.0-Serial-hwloc-release build_time=666 run_time=313
gcc-5.3.0-Serial-release build_time=671 run_time=309
gcc-7.2.0-Serial-hwloc-release build_time=496 run_time=171
gcc-7.2.0-Serial-release build_time=499 run_time=169
intel-17.0.1-OpenMP-hwloc-release build_time=924 run_time=111
intel-17.0.1-OpenMP-release build_time=919 run_time=115
#######################################################
FAILED TESTS
#######################################################

bowman spot check:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1503 run_time=572
intel-16.4.258-Pthread_Serial-release build_time=2442 run_time=1221
intel-16.4.258-Serial-release build_time=1459 run_time=609
intel-17.2.174-OpenMP-release build_time=1932 run_time=370
intel-17.2.174-OpenMP_Serial-release build_time=2876 run_time=1100
intel-17.2.174-Pthread-release build_time=1300 run_time=604
intel-17.2.174-Pthread_Serial-release build_time=2396 run_time=1253
intel-17.2.174-Serial-release build_time=1382 run_time=649
intel-18.2.199-OpenMP-release build_time=1584 run_time=402
intel-18.2.199-OpenMP_Serial-release build_time=2760 run_time=993
intel-18.2.199-Pthread-release build_time=1230 run_time=617
intel-18.2.199-Pthread_Serial-release build_time=2377 run_time=1199
intel-18.2.199-Serial-release build_time=1202 run_time=653
#######################################################
FAILED TESTS
#######################################################

white spot check:
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1096 run_time=364
cuda-9.2.88-Cuda_OpenMP-release build_time=1156 run_time=303
gcc-6.4.0-OpenMP_Serial-release build_time=568 run_time=321
gcc-7.2.0-OpenMP-release build_time=400 run_time=132
gcc-7.2.0-OpenMP_Serial-release build_time=638 run_time=363
gcc-7.2.0-Serial-release build_time=249 run_time=183
ibm-16.1.0-Serial-release build_time=1334 run_time=266
#######################################################
FAILED TESTS
#######################################################

@brian-kelley
Copy link
Contributor Author

@kyungjoo-kim Thanks for your suggestion about using Thrust sort. I will look into replacing this implementation with wrappers for that.

@srajama1
Copy link
Contributor

@brian-kelley : Thank you for adding these. Couple of things. Can we add documentation to internal wiki on when to use what, add the performance results to benchmark section of the wiki, add a github issue for new feature request (so we document it in the release), and add the performance test you are using to perf-test directory.

@srajama1 srajama1 self-requested a review September 11, 2019 16:07
Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-kelley Thanks for creating this PR. See comments below.

//Con: requires auxiliary storage, and this version only works for integers
template<typename Ordinal, typename ValueType, typename Thread>
KOKKOS_INLINE_FUNCTION void
radixSort(ValueType* values, ValueType* valuesAux, Ordinal n, const Thread thread)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be useful to call teamRadixSort ? We use this convention for other kernels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thread-local radix sort, with only vector-level parallelism inside. Should I call it threadRadixSort?

//While sorting, also permute "perm" array along with the values.
//Pros: few diverging branches, so good for sorting on a single GPU thread/warp.
//Con: requires auxiliary storage, this version only works for integers (although float/double is possible)
template<typename Ordinal, typename ValueType, typename PermType, typename Thread>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the above call becomes teamRadixSort, this could be just radixSort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the same sorting algorithm (thread-local), the only difference is it permutes "perm" along with the values while sorting.

maxVal = values[i];
}
//apply a bias so that key range always starts at 0
//also invert key values here for a descending sort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work on Volta ? I thought you would need a barrier here, as you use the minVal rightaway. May be not because minVal is threadlocal ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything inside radixSort/radixSort2 is thread-local, so I don't think warp divergence can cause any issues.

//Good diagram of the algorithm at https://en.wikipedia.org/wiki/Bitonic_sorter
template<typename Ordinal, typename ValueType, typename TeamMember, typename Comparator = DefaultComparator<ValueType>>
KOKKOS_INLINE_FUNCTION void
bitonicSortTeam(ValueType* values, Ordinal n, const TeamMember mem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You went with team as a suffix. Let us make sure we are consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add "Thread" as a suffix to radix then, that seems the most clear.


//Sort "values", while applying the same swaps to "perm"
template<typename Ordinal, typename ValueType, typename PermType, typename TeamMember, typename Comparator = DefaultComparator<ValueType>>
KOKKOS_INLINE_FUNCTION void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Team is in suffix

else
{
//Partition the data equally among fixed number of teams
const Ordinal numTeams = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be some functon of n and the hardware ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to pick a power-of-two value that is greater than the number of SMs on any existing GPU, so that this function will get full occupancy. Does Kokkos have a way to get the number of SMs? With that I could compute this at runtime. Although, I don't think it will have a big effect on performance since CUDA will schedule multiple blocks onto one SM if there is register space (I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 I just checked and Kokkos::Cuda::concurrency() returns the maximum concurrent number of threads over all SMs. Kokkos::Cuda does provide the SM count through cuda_internal_multiprocessor_count(). I'll look at how other KK functions choose team size and then do something based on that.

@@ -310,6 +212,33 @@ namespace Experimental {
CcolindsT ABpermAux;
};

#ifdef KOKKOS_ENABLE_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only on CUDA ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 Because radix sort is the best on CPUs, and bitonic is the best on CUDA.This is just a template specialization of SortEntriesFunctor that chooses the best for the given execution space.

CheckSortedFunctor<ValView>(data), Kokkos::Min<int>(ordered));
ASSERT_TRUE(ordered);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to add a test that overrides the default comparator.

@@ -0,0 +1,271 @@
/*
//@HEADER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a perf test, but I see comments on performance. How do you test performance ? If you have a test can you please commit it to perf-test. If you use the unit tests then that is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the numbers I posted, I just use the unit tests with the array sizes temporarily set much higher. It would be good to copy those over into perf tests.

@srajama1
Copy link
Contributor

@brian-kelley : Just checking to see if this is ready to go in.

Implemented fast sorting algorithms meant to be used
inside parallel kernels, since Kokkos::sort is device only.
Using those now in sparse matrix add for unsorted matrix compression
It is the last template param for bitonicSort, bitonicSortTeam and
bitonicSortTeam2. The default just does "operator<".
-Use reasonable fixed chunk size and vary number of teams
-Follow KokkosBatched naming conventions (Serial or Team prefix)
-Added testing for custom comparators in bitonicSort
@ndellingwood
Copy link
Contributor

Reran spot-check on kokkos-dev-2 with a merge of develop to confirm all tests are passing:

Running on machine: kokkos-dev-2
Going to test compilers:  gcc/7.3.0 gcc/9.1 intel/18.0.5 clang/8.0 cuda/10.1
Testing compiler gcc/7.3.0
Testing compiler gcc/9.1
  Starting job gcc-7.3.0-OpenMP-release
  Starting job gcc-7.3.0-Pthread-release
  PASSED gcc-7.3.0-OpenMP-release
  Starting job gcc-9.1-OpenMP-release
  PASSED gcc-7.3.0-Pthread-release
Testing compiler intel/18.0.5
  Starting job gcc-9.1-Serial-release
  PASSED gcc-9.1-OpenMP-release
Testing compiler clang/8.0
  Starting job intel-18.0.5-OpenMP-release
  PASSED gcc-9.1-Serial-release
  Starting job clang-8.0-Cuda_OpenMP-release
  PASSED intel-18.0.5-OpenMP-release
Testing compiler cuda/10.1
  Starting job clang-8.0-Pthread_Serial-release
  PASSED clang-8.0-Cuda_OpenMP-release
  PASSED clang-8.0-Pthread_Serial-release
  Starting job cuda-10.1-Cuda_OpenMP-release
  PASSED cuda-10.1-Cuda_OpenMP-release
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=569 run_time=709
clang-8.0-Pthread_Serial-release build_time=352 run_time=879
cuda-10.1-Cuda_OpenMP-release build_time=633 run_time=654
gcc-7.3.0-OpenMP-release build_time=225 run_time=272
gcc-7.3.0-Pthread-release build_time=209 run_time=502
gcc-9.1-OpenMP-release build_time=188 run_time=249
gcc-9.1-Serial-release build_time=162 run_time=502
intel-18.0.5-OpenMP-release build_time=309 run_time=288
#######################################################
FAILED TESTS
#######################################################

Merging, thanks @brian-kelley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants