Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use grid_stride_range in kernel loops #1178

Merged
merged 10 commits into from
Jun 27, 2023
3 changes: 3 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ include(cmake/thirdparty/CUSPATIAL_GetCUDF.cmake)
if (CUSPATIAL_BUILD_TESTS)
include(cmake/thirdparty/get_gtest.cmake)
endif()
# find or add ranger
include (cmake/thirdparty/get_ranger.cmake)

###################################################################################################
# - library targets -------------------------------------------------------------------------------
Expand Down Expand Up @@ -204,6 +206,7 @@ target_compile_definitions(cuspatial PUBLIC "SPDLOG_ACTIVE_LEVEL=SPDLOG_LEVEL_${

# Specify the target module library dependencies
target_link_libraries(cuspatial PUBLIC cudf::cudf)
target_link_libraries(cuspatial PRIVATE ranger::ranger)

add_library(cuspatial::cuspatial ALIAS cuspatial)

Expand Down
1 change: 1 addition & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ target_compile_features(cuspatial_benchmark_common PUBLIC cxx_std_17 cuda_std_17
target_link_libraries(cuspatial_benchmark_common
PUBLIC benchmark::benchmark
cudf::cudftestutil
ranger::ranger
cuspatial)

target_compile_options(cuspatial_benchmark_common
Expand Down
40 changes: 40 additions & 0 deletions cpp/cmake/thirdparty/get_ranger.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#=============================================================================
# Copyright (c) 2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

function(find_and_configure_ranger)

if(TARGET ranger::ranger)
return()
endif()

set(global_targets ranger::ranger)
set(find_package_args "")

rapids_cpm_find(
ranger 00.01.00
GLOBAL_TARGETS "${global_targets}"
BUILD_EXPORT_SET cuspatial-exports
harrism marked this conversation as resolved.
Show resolved Hide resolved
INSTALL_EXPORT_SET cuspatial-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/harrism/ranger.git
GIT_TAG main
GIT_SHALLOW TRUE
OPTIONS "BUILD_TESTS OFF"
FIND_PACKAGE_ARGUMENTS "${find_package_args}"
)
endfunction()

find_and_configure_ranger()
47 changes: 46 additions & 1 deletion cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,10 @@ caught during code review, or not enforced.
In general, we recommend following
[C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines). We also
recommend watching Sean Parent's [C++ Seasoning talk](https://www.youtube.com/watch?v=W2tWOdzgXHA),
and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives."
and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives." We also wherever possible add a fourth rule: "No raw kernels".

* Prefer algorithms from STL and Thrust to raw loops.
* Prefer Thrust algorithms to raw kernels.
* For device storage, prefer libcudf and RMM
[owning data structures and views](#libcuspatial-data-structures) to raw pointers and raw memory
allocation. When pointers are used, prefer smart pointers (e.g. `std::shared_ptr` and
Expand All @@ -131,6 +132,50 @@ and we try to follow his rules: "No raw loops. No raw pointers. No raw synchroni

Documentation is discussed in the [Documentation Guide](DOCUMENTATION.md).

### Loops and Grid-stride Loops

Prefer algorithms over raw loops wherever possible, as mentioned above. However, avoiding raw loops is not always possible. C++ range-based for loops can make raw loops much
clearer, and cuSpatial uses [Ranger](https://github.com/harrism/ranger) for this purpose.
Ranger provides range helpers with iterators that can be passed to range-based for loops. Of special importance is `ranger::grid_stride_range()`, which can be used to iterate over
a range in parallel using all threads of a CUDA grid.

When writing custom kernels, grid stride ranges help ensure kernels are adaptable to a
variety of grid shapes, most notably when there are fewer total threads than there are
data items. Instead of:

```c++
__global__ void foo(int n, int* data) {
auto const idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) return;

// process data
}
```

A grid-stride loop ensures all of data is processed even if there are fewer than n threads:

```c++
__global__ void foo(int n, int* data) {
for (auto const idx = threadIdx.x + blockIdx.x * blockDim.x;
idx < n;
idx += blockDim.x * gridDim.x) {
// process data
}
}
```

With ranger, the code is even clearer and less error prone:

```c++
#include <ranger/ranger.hpp>

__global__ void foo(int n, int* data) {
for (auto const idx = ranger::grid_stride_range(n)) {
// process data
}
}
```

### Includes

The following guidelines apply to organizing `#include` lines.
Expand Down
85 changes: 42 additions & 43 deletions cpp/include/cuspatial/detail/distance/hausdorff.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -24,6 +24,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/advance.h>
#include <thrust/binary_search.h>
#include <thrust/distance.h>
Expand Down Expand Up @@ -87,49 +89,46 @@ __global__ void kernel_hausdorff(
using Point = typename std::iterator_traits<PointIt>::value_type;

// determine the LHS point this thread is responsible for.
auto const thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
Index const lhs_p_idx = thread_idx;

if (lhs_p_idx >= num_points) { return; }

auto const lhs_space_iter =
thrust::upper_bound(thrust::seq, space_offsets, space_offsets + num_spaces, lhs_p_idx);
// determine the LHS space this point belongs to.
Index const lhs_space_idx = thrust::distance(space_offsets, thrust::prev(lhs_space_iter));

// get the coordinates of this LHS point.
Point const lhs_p = points[lhs_p_idx];

// loop over each RHS space, as determined by spa ce_offsets
for (uint32_t rhs_space_idx = 0; rhs_space_idx < num_spaces; rhs_space_idx++) {
// determine the begin/end offsets of points contained within this RHS space.
Index const rhs_p_idx_begin = space_offsets[rhs_space_idx];
Index const rhs_p_idx_end =
(rhs_space_idx + 1 == num_spaces) ? num_points : space_offsets[rhs_space_idx + 1];

// each space must contain at least one point, this initial value is just an identity value to
// simplify calculations. If a space contains <= 0 points, then this initial value will be
// written to the output, which can serve as a signal that the input is ill-formed.
auto min_distance_squared = std::numeric_limits<T>::max();

// loop over each point in the current RHS space
for (uint32_t rhs_p_idx = rhs_p_idx_begin; rhs_p_idx < rhs_p_idx_end; rhs_p_idx++) {
// get the x and y coordinate of this RHS point
Point const rhs_p = thrust::raw_reference_cast(points[rhs_p_idx]);

// get distance between the LHS and RHS point
auto const distance_squared = magnitude_squared(rhs_p.x - lhs_p.x, rhs_p.y - lhs_p.y);

// remember only smallest distance from this LHS point to any RHS point.
min_distance_squared = ::min(min_distance_squared, distance_squared);
for (auto lhs_p_idx : ranger::grid_stride_range(num_points)) {
auto const lhs_space_iter =
thrust::upper_bound(thrust::seq, space_offsets, space_offsets + num_spaces, lhs_p_idx);
// determine the LHS space this point belongs to.
Index const lhs_space_idx = thrust::distance(space_offsets, thrust::prev(lhs_space_iter));

// get the coordinates of this LHS point.
Point const lhs_p = points[lhs_p_idx];

// loop over each RHS space, as determined by spa ce_offsets
for (uint32_t rhs_space_idx = 0; rhs_space_idx < num_spaces; rhs_space_idx++) {
// determine the begin/end offsets of points contained within this RHS space.
Index const rhs_p_idx_begin = space_offsets[rhs_space_idx];
Index const rhs_p_idx_end =
(rhs_space_idx + 1 == num_spaces) ? num_points : space_offsets[rhs_space_idx + 1];

// each space must contain at least one point, this initial value is just an identity value to
// simplify calculations. If a space contains <= 0 points, then this initial value will be
// written to the output, which can serve as a signal that the input is ill-formed.
auto min_distance_squared = std::numeric_limits<T>::max();

// loop over each point in the current RHS space
for (uint32_t rhs_p_idx = rhs_p_idx_begin; rhs_p_idx < rhs_p_idx_end; rhs_p_idx++) {
// get the x and y coordinate of this RHS point
Point const rhs_p = thrust::raw_reference_cast(points[rhs_p_idx]);

// get distance between the LHS and RHS point
auto const distance_squared = magnitude_squared(rhs_p.x - lhs_p.x, rhs_p.y - lhs_p.y);

// remember only smallest distance from this LHS point to any RHS point.
min_distance_squared = ::min(min_distance_squared, distance_squared);
}

// determine the output offset for this pair of spaces (LHS, RHS)
Index output_idx = rhs_space_idx * num_spaces + lhs_space_idx;

// use atomicMax to find the maximum of the minimum distance calculated for each space pair.
atomicMax(&thrust::raw_reference_cast(*(results + output_idx)),
static_cast<T>(std::sqrt(min_distance_squared)));
}

// determine the output offset for this pair of spaces (LHS, RHS)
Index output_idx = rhs_space_idx * num_spaces + lhs_space_idx;

// use atomicMax to find the maximum of the minimum distance calculated for each space pair.
atomicMax(&thrust::raw_reference_cast(*(results + output_idx)),
static_cast<T>(std::sqrt(min_distance_squared)));
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/uninitialized_fill.h>

namespace cuspatial {
Expand All @@ -37,8 +39,7 @@ void __global__ simple_find_and_combine_segments_kernel(OffsetRange offsets,
SegmentRange segments,
OutputIt merged_flag)
{
for (auto pair_idx = threadIdx.x + blockIdx.x * blockDim.x; pair_idx < offsets.size() - 1;
pair_idx += gridDim.x * blockDim.x) {
for (auto pair_idx : ranger::grid_stride_range(offsets.size() - 1)) {
// Zero-initialize flags for all segments in current space.
for (auto i = offsets[pair_idx]; i < offsets[pair_idx + 1]; i++) {
merged_flag[i] = 0;
Expand Down
5 changes: 3 additions & 2 deletions cpp/include/cuspatial/detail/find/find_duplicate_points.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <ranger/ranger.hpp>

#include <thrust/uninitialized_fill.h>

namespace cuspatial {
Expand All @@ -35,8 +37,7 @@ template <typename MultiPointRange, typename OutputIt>
void __global__ find_duplicate_points_kernel_simple(MultiPointRange multipoints,
OutputIt duplicate_flags)
{
for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multipoints.size();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multipoints.size())) {
auto multipoint = multipoints[idx];
auto global_offset = multipoints.offsets_begin()[idx];

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -23,6 +23,8 @@

#include <rmm/cuda_stream_view.hpp>

#include <ranger/ranger.hpp>

#include <thrust/tuple.h>

namespace cuspatial {
Expand All @@ -38,8 +40,7 @@ __global__ void count_intersection_and_overlaps_simple(MultiLinestringRange1 mul
OutputIt2 segment_count_it)
{
using T = typename MultiLinestringRange1::element_t;
for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
auto const part_idx = multilinestrings1.part_idx_from_point_idx(idx);
if (!multilinestrings1.is_valid_segment_id(idx, part_idx)) continue;
auto const geometry_idx = multilinestrings1.geometry_idx_from_part_idx(part_idx);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
#include <rmm/exec_policy.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <ranger/ranger.hpp>

#include <thrust/binary_search.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
Expand Down Expand Up @@ -395,8 +397,7 @@ void __global__ pairwise_linestring_intersection_simple(MultiLinestringRange1 mu
using types_t = uint8_t;
using count_t = iterator_value_type<Offsets1>;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
if (auto const part_idx_opt = multilinestrings1.part_idx_from_segment_idx(idx);
part_idx_opt.has_value()) {
auto const part_idx = part_idx_opt.value();
Expand Down
13 changes: 7 additions & 6 deletions cpp/include/cuspatial/detail/kernel/pairwise_distance.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@
#include <cuspatial/detail/utility/device_atomics.cuh>
#include <cuspatial/detail/utility/linestring.cuh>

#include <limits>
#include <rmm/device_uvector.hpp>

#include <ranger/ranger.hpp>

#include <thrust/optional.h>

#include <limits>

namespace cuspatial {
namespace detail {

Expand Down Expand Up @@ -52,8 +55,7 @@ __global__ void linestring_distance(MultiLinestringRange1 multilinestrings1,
{
using T = typename MultiLinestringRange1::element_t;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings1.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings1.num_points())) {
auto const part_idx = multilinestrings1.part_idx_from_point_idx(idx);
if (!multilinestrings1.is_valid_segment_id(idx, part_idx)) continue;
auto const geometry_idx = multilinestrings1.geometry_idx_from_part_idx(part_idx);
Expand Down Expand Up @@ -89,15 +91,14 @@ __global__ void linestring_distance(MultiLinestringRange1 multilinestrings1,
* set to nullopt, no distance computation will be bypassed.
*/
template <class MultiPointRange, class MultiLinestringRange, class OutputIterator>
void __global__ point_linestring_distance(MultiPointRange multipoints,
__global__ void point_linestring_distance(MultiPointRange multipoints,
MultiLinestringRange multilinestrings,
thrust::optional<uint8_t*> intersects,
OutputIterator distances)
{
using T = typename MultiPointRange::element_t;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < multilinestrings.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(multilinestrings.num_points())) {
// Search from the part offsets array to determine the part idx of current linestring point
auto part_idx = multilinestrings.part_idx_from_point_idx(idx);
// Pointer to the last point in the linestring, skip iteration.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
#include <cuspatial/range/range.cuh>
#include <cuspatial/traits.hpp>

#include <ranger/ranger.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/exec_policy.hpp>
Expand All @@ -47,8 +49,7 @@ void __global__ pairwise_multipoint_equals_count_kernel(MultiPointRangeA lhs,
{
using T = typename MultiPointRangeA::point_t::value_type;

for (auto idx = threadIdx.x + blockIdx.x * blockDim.x; idx < lhs.num_points();
idx += gridDim.x * blockDim.x) {
for (auto idx : ranger::grid_stride_range(lhs.num_points())) {
auto geometry_id = lhs.geometry_idx_from_point_idx(idx);
vec_2d<T> lhs_point = lhs.point_begin()[idx];
auto rhs_multipoint = rhs[geometry_id];
Expand Down
Loading