Skip to content

Commit

Permalink
Use grid_stride_range in kernel loops (#1178)
Browse files Browse the repository at this point in the history
Adds a dependency on [Ranger](https://github.com/harrism/ranger) and uses `ranger::grid_stride_loop` with a range-based for loop in outer loops in every raw kernel in cuSpatial.  In the few cases where we had raw kernels without grid-stride loops, this adds one. Also adds guidance to the C++ developer guide.

Authors:
  - Mark Harris (https://github.com/harrism)
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #1178
  • Loading branch information
harrism authored Jun 27, 2023
1 parent 3da3332 commit 7cb3000
Show file tree
Hide file tree
Showing 15 changed files with 168 additions and 70 deletions.
3 changes: 3 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ include(cmake/thirdparty/get_cudf.cmake)
if (CUSPATIAL_BUILD_TESTS)
include(cmake/thirdparty/get_gtest.cmake)
endif()
# find or add ranger
include (cmake/thirdparty/get_ranger.cmake)

###################################################################################################
# - library targets -------------------------------------------------------------------------------
Expand Down Expand Up @@ -201,6 +203,7 @@ target_compile_definitions(cuspatial PUBLIC "SPDLOG_ACTIVE_LEVEL=SPDLOG_LEVEL_${

# Specify the target module library dependencies
target_link_libraries(cuspatial PUBLIC cudf::cudf)
target_link_libraries(cuspatial PRIVATE ranger::ranger)

add_library(cuspatial::cuspatial ALIAS cuspatial)

Expand Down
1 change: 1 addition & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ target_compile_features(cuspatial_benchmark_common PUBLIC cxx_std_17 cuda_std_17
target_link_libraries(cuspatial_benchmark_common
PUBLIC benchmark::benchmark
cudf::cudftestutil
ranger::ranger
cuspatial)

target_compile_options(cuspatial_benchmark_common
Expand Down
40 changes: 40 additions & 0 deletions cpp/cmake/thirdparty/get_ranger.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#=============================================================================
# Copyright (c) 2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

function(find_and_configure_ranger)

if(TARGET ranger::ranger)
return()
endif()

set(global_targets ranger::ranger)
set(find_package_args "")

rapids_cpm_find(
ranger 00.01.00
GLOBAL_TARGETS "${global_targets}"
BUILD_EXPORT_SET cuspatial-exports
INSTALL_EXPORT_SET cuspatial-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/harrism/ranger.git
GIT_TAG main
GIT_SHALLOW TRUE
OPTIONS "BUILD_TESTS OFF"
FIND_PACKAGE_ARGUMENTS "${find_package_args}"
)
endfunction()

find_and_configure_ranger()
47 changes: 46 additions & 1 deletion cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,10 @@ caught during code review, or not enforced.
In general, we recommend following
[C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines). We also
recommend watching Sean Parent's [C++ Seasoning talk](https://www.youtube.com/watch?v=W2tWOdzgXHA),
and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives."
and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives." We also wherever possible add a fourth rule: "No raw kernels".
* Prefer algorithms from STL and Thrust to raw loops.
* Prefer Thrust algorithms to raw kernels.
* For device storage, prefer libcudf and RMM
[owning data structures and views](#libcuspatial-data-structures) to raw pointers and raw memory
allocation. When pointers are used, prefer smart pointers (e.g. `std::shared_ptr` and
Expand All @@ -131,6 +132,50 @@ and we try to follow his rules: "No raw loops. No raw pointers. No raw synchroni
Documentation is discussed in the [Documentation Guide](DOCUMENTATION.md).
### Loops and Grid-stride Loops
Prefer algorithms over raw loops wherever possible, as mentioned above. However, avoiding raw loops is not always possible. C++ range-based for loops can make raw loops much
clearer, and cuSpatial uses [Ranger](https://github.com/harrism/ranger) for this purpose.
Ranger provides range helpers with iterators that can be passed to range-based for loops. Of special importance is `ranger::grid_stride_range()`, which can be used to iterate over
a range in parallel using all threads of a CUDA grid.
When writing custom kernels, grid stride ranges help ensure kernels are adaptable to a
variety of grid shapes, most notably when there are fewer total threads than there are
data items. Instead of:
```c++
__global__ void foo(int n, int* data) {
auto const idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) return;
// process data
}
```

A grid-stride loop ensures all of data is processed even if there are fewer than n threads:

```c++
__global__ void foo(int n, int* data) {
for (auto const idx = threadIdx.x + blockIdx.x * blockDim.x;
idx < n;
idx += blockDim.x * gridDim.x) {
// process data
}
}
```
With ranger, the code is even clearer and less error prone:
```c++
#include <ranger/ranger.hpp>
__global__ void foo(int n, int* data) {
for (auto const idx = ranger::grid_stride_range(n)) {
// process data
}
}
```

### Includes

The following guidelines apply to organizing `#include` lines.
Expand Down
85 changes: 42 additions & 43 deletions cpp/include/cuspatial/detail/distance/hausdorff.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -24,6 +24,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/advance.h>
#include <thrust/binary_search.h>
#include <thrust/distance.h>
Expand Down Expand Up @@ -87,49 +89,46 @@ __global__ void kernel_hausdorff(
using Point = typename std::iterator_traits<PointIt>::value_type;

// determine the LHS point this thread is responsible for.
auto const thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
Index const lhs_p_idx = thread_idx;

if (lhs_p_idx >= num_points) { return; }

auto const lhs_space_iter =
thrust::upper_bound(thrust::seq, space_offsets, space_offsets + num_spaces, lhs_p_idx);
// determine the LHS space this point belongs to.
Index const lhs_space_idx = thrust::distance(space_offsets, thrust::prev(lhs_space_iter));

// get the coordinates of this LHS point.
Point const lhs_p = points[lhs_p_idx];

// loop over each RHS space, as determined by spa ce_offsets
for (uint32_t rhs_space_idx = 0; rhs_space_idx < num_spaces; rhs_space_idx++) {
// determine the begin/end offsets of points contained within this RHS space.
Index const rhs_p_idx_begin = space_offsets[rhs_space_idx];
Index const rhs_p_idx_end =
(rhs_space_idx + 1 == num_spaces) ? num_points : space_offsets[rhs_space_idx + 1];

// each space must contain at least one point, this initial value is just an identity value to
// simplify calculations. If a space contains <= 0 points, then this initial value will be
// written to the output, which can serve as a signal that the input is ill-formed.
auto min_distance_squared = std::numeric_limits<T>::max();

// loop over each point in the current RHS space
for (uint32_t rhs_p_idx = rhs_p_idx_begin; rhs_p_idx < rhs_p_idx_end; rhs_p_idx++) {
// get the x and y coordinate of this RHS point
Point const rhs_p = thrust::raw_reference_cast(points[rhs_p_idx]);

// get distance between the LHS and RHS point
auto const distance_squared = magnitude_squared(rhs_p.x - lhs_p.x, rhs_p.y - lhs_p.y);

// remember only smallest distance from this LHS point to any RHS point.
min_distance_squared = ::min(min_distance_squared, distance_squared);
for (auto lhs_p_idx : ranger::grid_stride_range(num_points)) {
auto const lhs_space_iter =
thrust::upper_bound(thrust::seq, space_offsets, space_offsets + num_spaces, lhs_p_idx);
// determine the LHS space this point belongs to.
Index const lhs_space_idx = thrust::distance(space_offsets, thrust::prev(lhs_space_iter));

// get the coordinates of this LHS point.
Point const lhs_p = points[lhs_p_idx];

// loop over each RHS space, as determined by spa ce_offsets
for (uint32_t rhs_space_idx = 0; rhs_space_idx < num_spaces; rhs_space_idx++) {
// determine the begin/end offsets of points contained within this RHS space.
Index const rhs_p_idx_begin = space_offsets[rhs_space_idx];
Index const rhs_p_idx_end =
(rhs_space_idx + 1 == num_spaces) ? num_points : space_offsets[rhs_space_idx + 1];

// each space must contain at least one point, this initial value is just an identity value to
// simplify calculations. If a space contains <= 0 points, then this initial value will be
// written to the output, which can serve as a signal that the input is ill-formed.
auto min_distance_squared = std::numeric_limits<T>::max();

// loop over each point in the current RHS space
for (uint32_t rhs_p_idx = rhs_p_idx_begin; rhs_p_idx < rhs_p_idx_end; rhs_p_idx++) {
// get the x and y coordinate of this RHS point
Point const rhs_p = thrust::raw_reference_cast(points[rhs_p_idx]);

// get distance between the LHS and RHS point
auto const distance_squared = magnitude_squared(rhs_p.x - lhs_p.x, rhs_p.y - lhs_p.y);

// remember only smallest distance from this LHS point to any RHS point.
min_distance_squared = ::min(min_distance_squared, distance_squared);
}

// determine the output offset for this pair of spaces (LHS, RHS)
Index output_idx = rhs_space_idx * num_spaces + lhs_space_idx;

// use atomicMax to find the maximum of the minimum distance calculated for each space pair.
atomicMax(&thrust::raw_reference_cast(*(results + output_idx)),
static_cast<T>(std::sqrt(min_distance_squared)));
}

// determine the output offset for this pair of spaces (LHS, RHS)
Index output_idx = rhs_space_idx * num_spaces + lhs_space_idx;

// use atomicMax to find the maximum of the minimum distance calculated for each space pair.
atomicMax(&thrust::raw_reference_cast(*(results + output_idx)),
static_cast<T>(std::sqrt(min_distance_squared)));
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/uninitialized_fill.h>

namespace cuspatial {
Expand All @@ -37,8 +39,7 @@ void __global__ simple_find_and_combine_segments_kernel(OffsetRange offsets,
SegmentRange segments,
OutputIt merged_flag)
{
for (auto pair_idx = threadIdx.x + blockIdx.x * blockDim.x; pair_idx < offsets.size() - 1;
pair_idx += gridDim.x * blockDim.x) {
for (auto pair_idx : ranger::grid_stride_range(offsets.size() - 1)) {
// Zero-initialize flags for all segments in current space.
for (auto i = offsets[pair_idx]; i < offsets[pair_idx + 1]; i++) {
merged_flag[i] = 0;
Expand Down
5 changes: 3 additions & 2 deletions cpp/include/cuspatial/detail/find/find_duplicate_points.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/uninitialized_fill.h>

namespace cuspatial {
Expand All @@ -35,8 +37,7 @@ template <typename MultiPointRange, typename OutputIt>
void __global__ find_duplicate_points_kernel_simple(MultiPointRange multipoints,
OutputIt duplicate_flags)
{
for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multipoints.size();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multipoints.size())) {
auto multipoint = multipoints[idx];
auto global_offset = multipoints.offsets_begin()[idx];

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -23,6 +23,8 @@

#include <rmm/cuda_stream_view.hpp>

#include <ranger/ranger.hpp>

#include <thrust/tuple.h>

namespace cuspatial {
Expand All @@ -38,8 +40,7 @@ __global__ void count_intersection_and_overlaps_simple(MultiLinestringRange1 mul
OutputIt2 segment_count_it)
{
using T = typename MultiLinestringRange1::element_t;
for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
auto const part_idx = multilinestrings1.part_idx_from_point_idx(idx);
if (!multilinestrings1.is_valid_segment_id(idx, part_idx)) continue;
auto const geometry_idx = multilinestrings1.geometry_idx_from_part_idx(part_idx);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
#include <rmm/exec_policy.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <ranger/ranger.hpp>

#include <thrust/binary_search.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
Expand Down Expand Up @@ -395,8 +397,7 @@ void __global__ pairwise_linestring_intersection_simple(MultiLinestringRange1 mu
using types_t = uint8_t;
using count_t = iterator_value_type<Offsets1>;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
if (auto const part_idx_opt = multilinestrings1.part_idx_from_segment_idx(idx);
part_idx_opt.has_value()) {
auto const part_idx = part_idx_opt.value();
Expand Down
13 changes: 7 additions & 6 deletions cpp/include/cuspatial/detail/kernel/pairwise_distance.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@
#include <cuspatial/detail/utility/device_atomics.cuh>
#include <cuspatial/detail/utility/linestring.cuh>

#include <limits>
#include <rmm/device_uvector.hpp>

#include <ranger/ranger.hpp>

#include <thrust/optional.h>

#include <limits>

namespace cuspatial {
namespace detail {

Expand Down Expand Up @@ -52,8 +55,7 @@ __global__ void linestring_distance(MultiLinestringRange1 multilinestrings1,
{
using T = typename MultiLinestringRange1::element_t;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
auto const part_idx = multilinestrings1.part_idx_from_point_idx(idx);
if (!multilinestrings1.is_valid_segment_id(idx, part_idx)) continue;
auto const geometry_idx = multilinestrings1.geometry_idx_from_part_idx(part_idx);
Expand Down Expand Up @@ -89,15 +91,14 @@ __global__ void linestring_distance(MultiLinestringRange1 multilinestrings1,
* set to nullopt, no distance computation will be bypassed.
*/
template <class MultiPointRange, class MultiLinestringRange, class OutputIterator>
void __global__ point_linestring_distance(MultiPointRange multipoints,
__global__ void point_linestring_distance(MultiPointRange multipoints,
MultiLinestringRange multilinestrings,
thrust::optional<uint8_t*> intersects,
OutputIterator distances)
{
using T = typename MultiPointRange::element_t;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings.num_points())) {
// Search from the part offsets array to determine the part idx of current linestring point
auto part_idx = multilinestrings.part_idx_from_point_idx(idx);
// Pointer to the last point in the linestring, skip iteration.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
#include <cuspatial/range/range.cuh>
#include <cuspatial/traits.hpp>

#include <ranger/ranger.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/exec_policy.hpp>
Expand All @@ -47,8 +49,7 @@ void __global__ pairwise_multipoint_equals_count_kernel(MultiPointRangeA lhs,
{
using T = typename MultiPointRangeA::point_t::value_type;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < lhs.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(lhs.num_points())) {
auto geometry_id = lhs.geometry_idx_from_point_idx(idx);
vec_2d<T> lhs_point = lhs.point_begin()[idx];
auto rhs_multipoint = rhs[geometry_id];
Expand Down
Loading

0 comments on commit 7cb3000

Please sign in to comment.