Prevent grid stride loop overflow in libcudf kernels #10368

nvdbaranec · 2022-02-28T19:30:02Z

(updated Aug 2023), (updated Nov 2024)

Background

We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;  
while (output_index < output_size) {
  output_index += blockDim.x * gridDim.x;
}

If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).

We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.

Part 1 - First pass fix kernels with this issue

Source file	Kernels	Status
`copying/concatenate.cu`	`fused_concatenate_kernel`	#10448
`valid_if.cuh`	`valid_if_kernel`	#10448
`scatter.cu`	`marking_bitmask_kernel`	#10448
`replace/nulls.cu`	`replace_nulls_strings`	#10448
`replace/nulls.cu`	`replace_nulls`	#10448
`rolling/rolling_detail.cuh`	`gpu_rolling`	#10448
`rolling/jit/kernel.cu`	`gpu_rolling_new`	#10448
`transform/compute_column.cu`	`compute_column_kernel`	#10448
`copying/concatenate.cu`	`fused_concatenate_string_offset_kernel`	#13838
`replace/replace.cu`	`replace_strings_first_pass` `replace_strings_second_pass` `replace_kernel`	#13905
`copying/concatenate.cu`	`concatenate_masks_kernel` `fused_concatenate_string_offset_kernel` `fused_concatenate_string_chars_kernel` `fused_concatenate_kernel` (int64)	#13906
`hash/helper_functions.cuh`	`init_hashtbl`	#13895
`null_mask.cu`	`set_null_mask_kernel` `copy_offset_bitmask` `count_set_bits_kernel`	#13895
`transform/row_bit_count.cu`	`compute_row_sizes`	#13895
`multibyte_split.cu`	`multibyte_split_init_kernel` `multibyte_split_seed_kernel` (auto??) `multibyte_split_kernel`	#13910
IO modules: parquet, orc, json		#13910
`io/utilities/parsing_utils.cu`	`count_and_set_positions` (uint64_t)	#13910
`conditional_join_kernels.cuh`	`compute_conditional_join_output_size` `conditional_join`	#13971
`merge.cu`	`materialize_merged_bitmask_kernel`	#13972
`partitioning.cu`	`compute_row_partition_numbers` `compute_row_output_locations` `copy_block_partitions`	#13973
`json_path.cu`	`get_json_object_kernel`	#13962
`tdigest`	`compute_percentiles_kernel` (int)	#13962
`strings/attributes.cu`	`count_characters_parallel_fn`	#13968
`strings/convert/convert_urls.cu`	`url_decode_char_counter` (int) `url_decode_char_replacer` (int)	#13968
`text/subword/data_normalizer.cu`	`kernel_data_normalizer` (uint32_t)	#13915
`text/subword/subword_tokenize.cu`	`kernel_compute_tensor_metadata` (uint32_t)	#13915
`text/subword/wordpiece_tokenizer.cu`	`init_data_and_mark_word_start_and_ends` (uint32_t) `mark_string_start_and_ends` (uint32_t) `kernel_wordpiece_tokenizer` (uint32_t)	#13915

Part 2 - Take another pass over more challenging kernels

Source file	Kernels	Status
null_mask.cuh	subtract_set_bits_range_boundaries_kernel
valid_if.cuh	valid_if_n_kernel
copy_if_else.cuh	copy_if_else_kernel
gather.cuh	gather_chars_fn_string_parallel
more?	search `gridDim.x` or `blockDim.x` to find more examples

Part 2b - results of searching

Source file: line content	Overflow risk (yes/no)	Status
`cpp/src/binaryop/compiled/binary_ops.cuh`	yes	#17354
`cpp/src/binaryop/jit/kernel.cu`	yes	#17420
`include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;` `include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;` `include/cudf/detail/copy_if.cuh: int const wid = threadIdx.x / cudf::detail::warp_size;` `include/cudf/detail/copy_if.cuh: int const lane = threadIdx.x % cudf::detail::warp_size;`		17520
`include/cudf/detail/copy_if_else.cuh: int const lane_id = threadIdx.x % cudf::detail::warp_size;`	no	17473
`include/cudf/detail/copy_range.cuh: int const lane_id = threadIdx.x % warp_size;` `include/cudf/detail/copy_range.cuh: cudf::size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;` `include/cudf/detail/copy_range.cuh: source_idx += blockDim.x * gridDim.x;`	yes	#17409
`include/cudf/detail/null_mask.cuh: auto const tid = threadIdx.x + blockIdx.x * blockDim.x;` `include/cudf/detail/null_mask.cuh: destination_word_index += blockDim.x * gridDim.x) {` `include/cudf/detail/null_mask.cuh: size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;` `include/cudf/detail/null_mask.cuh: range_id += blockDim.x * gridDim.x;`	no
`include/cudf/detail/valid_if.cuh: auto block_offset = blockIdx.x * blockDim.x;` `include/cudf/detail/valid_if.cuh: auto const thread_idx = block_offset + threadIdx.x;` `include/cudf/detail/valid_if.cuh: block_offset += blockDim.x * gridDim.x;`	no
`include/cudf/strings/detail/gather.cuh: int global_thread_id = blockIdx.x * blockDim.x + threadIdx.x;` `include/cudf/strings/detail/gather.cuh: int nwarps = gridDim.x * blockDim.x / cudf::detail::warp_size;` `include/cudf/strings/detail/gather.cuh: size_type begin_out_string_idx = blockIdx.x * strings_per_threadblock;`		#17404
`src/join/mixed_join_kernel.cuh: cudf::size_type outer_row_index = threadIdx.x + blockIdx.x * block_size;` `src/join/mixed_join_size_kernel.cuh: cudf::size_type const start_idx = threadIdx.x + blockIdx.x * block_size;`		#17633
`src/partitioning/partitioning.cu: size_type partition_number = threadIdx.x;` `src/partitioning/partitioning.cu: size_type const write_location = partition_number * gridDim.x + blockIdx.x;` `src/partitioning/partitioning.cu: block_partition_offsets[partition_number * gridDim.x + blockIdx.x];` `src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {` `src/partitioning/partitioning.cu: block_partition_sizes[blockIdx.x + (ELEMENTS_PER_THREAD * threadIdx.x + i) * gridDim.x];` `src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {` `src/partitioning/partitioning.cu: scanned_block_partition_sizes[ipartition * gridDim.x + blockIdx.x];`	yes	#17473
`src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;` `src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;`		#17404
`src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);` `src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);`		#17404
`src/text/subword/data_normalizer.cu: uint32_t* block_base = code_points + blockIdx.x * blockDim.x * MAX_NEW_CHARS;`		#17404
`src/text/vocabulary_tokenize.cu: auto const idx = static_cast<std::size_t>(threadIdx.x + blockIdx.x * blockDim.x);`		#17404
`src/transform/compute_column.cu: &intermediate_storage[threadIdx.x * device_expression_data.num_intermediates];`	no
`src/transform/jit/kernel.cu: thread_index_type const start = threadIdx.x + blockIdx.x * blockDim.x;<br>src/transform/jit/kernel.cu: thread_index_type const stride = blockDim.x * gridDim.x;`		#17473
`src/transform/row_bit_count.cu: int const tid = threadIdx.x + blockIdx.x * blockDim.x;` `src/transform/row_bit_count.cu: row_span* my_branch_stack = thread_branch_stacks + (threadIdx.x * max_branch_depth);`		#17473

Additional information

There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel, source_idx probably overflows, but harmlessly.

A snippet of code to see this in action:

size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);  
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{});

Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2022-02-28T19:35:52Z

The fix in basically all of these cases is quite simple: just make the index a size_t

jrhemstad · 2022-02-28T19:53:46Z

I'd love to just add an algorithm to do this.

https://godbolt.org/z/hK95z7zff

harrism · 2022-03-15T05:29:29Z

Or even a simple range helper for for loops: https://github.com/harrism/hemi#simple-grid-stride-loops

harrism · 2022-03-15T05:32:31Z

The fix in basically all of these cases is quite simple: just make the index a size_t

I think the general approach should be:

Use an algorithm (thrust:: or std::) if possible before ever writing a custom kernel -- this way you write a per-element device functor instead, and indexing is handled for you.
If a custom kernel must be written, it should use device-side algorithms instead of raw loops.
If a raw grid-stride loop is required and an existing algorithm won't work, we should provide utilities to abstract the iteration and or the range and use auto for the type to avoid these mistakes.

Partially addresses #10368 Specifically: - `valid_if` - `scatter` - `rolling_window` - `compute_column_kernel` (ast stuff) - `replace_nulls` (fixed-width and strings) The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`. Although scatter had an additional place it was overflowing outside the kernel. I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes. Authors: - https://github.com/nvdbaranec Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #10448

github-actions · 2022-04-16T01:30:43Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-04-19T15:31:47Z

Still relevant.

harrism · 2022-04-26T23:29:16Z

Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf.

github-actions · 2022-05-27T00:11:00Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-05-27T14:43:14Z

Still relevant.

github-actions · 2022-06-26T15:03:10Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-06-27T15:25:51Z

Still relevant.

github-actions · 2022-09-26T05:30:51Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

GregoryKimball · 2022-11-29T01:12:59Z

Thanks @harrism for creating the ranger repo! Do you think we are ready to kick off an integration with libcudf, or does ranger need more development first?

wence- · 2023-08-10T16:56:32Z

wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add -Wsign-conversion (not covered by -Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.

#include <cstdint>
int what(int upper)
{
  int i = 0; // no warning if this is a std::int64_t
  unsigned int stride = 10;
  while (i < upper) {
    i = i + stride; // clang warns for this, so does gcc
  }
  i = 0;
  while (i < upper) {
    i += stride; // gcc warns for this, clang does not.
  }
  return i;
}

This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13996

…joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #13971

This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #14107

For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also rapidsai#10368. - Closes rapidsai#17353

For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also #10368. - Closes #17353 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) - Tianyu Liu (https://github.com/kingcrimsontianyu) - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) URL: #17354

…culations (#17404) Replaces `threadIdx.x + blockDim.x * blockIdx.x` logic with `grid_1d::global_thread_id()` and `blockDim.x * gridDim.x` with `grid_1d::grid_stride()` in libcudf strings and text source. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17404

Fixes some possible thread-id calculations or usages that may possibly overflow `int32` type or `size_type`. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Shruti Shivakumar (https://github.com/shrshi) URL: #17473

Fixes possible integer overflow condition when the number of rows is near max int32 in `compute_mixed_join_output_size` kernel function. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #17633

GregoryKimball · 2025-01-16T23:36:13Z

I believe we've completed several passes over the library and addressed the issues. Please let me know if you identify any cases we've missed.

nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify labels Feb 28, 2022

This was referenced Mar 17, 2022

Batch of fixes for index overflows in grid stride loops. #10448

Merged

Column to JCUDF row for tables with strings #10235

Merged

sameerz mentioned this issue Apr 12, 2022

[TASK] Big Reliability Epic NVIDIA/spark-rapids#1870

Closed

14 tasks

github-actions bot added the inactive-30d label Apr 16, 2022

github-actions bot removed the inactive-30d label Apr 19, 2022

github-actions bot added the inactive-30d label Jun 26, 2022

github-actions bot removed the inactive-30d label Jun 27, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022

github-actions bot added the inactive-90d label Sep 26, 2022

nvdbaranec mentioned this issue Nov 9, 2022

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

Closed

GregoryKimball removed the inactive-90d label Apr 3, 2023

GregoryKimball added this to the Helps developer velocity milestone Apr 3, 2023

GregoryKimball mentioned this issue Aug 10, 2023

[BUG] Errors converting tables from arrow to cuDF #13771

Closed

GregoryKimball modified the milestones: Helps developer velocity, Stabilizing large workflows (OOM, spilling, partitioning) Aug 10, 2023

vyasr mentioned this issue Aug 18, 2023

Add minhash support for MurmurHash3_x64_128 #13796

Merged

3 tasks

vyasr mentioned this issue Aug 25, 2023

Use thread_index_type to avoid out of bounds accesses in conditional joins #13971

Merged

3 tasks

bdice mentioned this issue Aug 29, 2023

Use grid_stride for stride computations. #13996

Merged

3 tasks

bdice mentioned this issue Sep 13, 2023

Some additional kernel thread index refactoring. #14107

Merged

3 tasks

wence- mentioned this issue Sep 18, 2023

Long string optimization for string column parsing in JSON reader #13803

Merged

3 tasks

GregoryKimball removed the status in libcudf Oct 26, 2023

GregoryKimball moved this to To be revisited in libcudf Oct 26, 2023

bdice mentioned this issue Jan 2, 2024

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

Closed

3 tasks

vyasr removed the helps: Dask label Feb 23, 2024

bdice mentioned this issue Nov 18, 2024

[BUG] Integer overflow in compiled binops for large tables. #17353

Closed

wence- added a commit to wence-/cudf that referenced this issue Nov 18, 2024

Fix integer overflow in compiled binaryop

28908e4

For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also rapidsai#10368. - Closes rapidsai#17353

wence- mentioned this issue Nov 18, 2024

Fix integer overflow in compiled binaryop #17354

Merged

3 tasks

GregoryKimball changed the title ~~Use "ranger" to prevent grid stride loop overflow~~ Prevent grid stride loop overflow in libcudf kernels Nov 19, 2024

davidwendt mentioned this issue Nov 21, 2024

Update strings/text source to use grid_1d for thread/block/stride calculations #17404

Merged

3 tasks

davidwendt mentioned this issue Dec 2, 2024

Fix some possible thread-id overflow calculations #17473

Merged

3 tasks

GregoryKimball moved this from To be revisited to In progress in libcudf Dec 2, 2024

davidwendt mentioned this issue Dec 19, 2024

Fix possible int overflow in compute_mixed_join_output_size #17633

Merged

3 tasks

GregoryKimball closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent grid stride loop overflow in libcudf kernels #10368

Prevent grid stride loop overflow in libcudf kernels #10368

nvdbaranec commented Feb 28, 2022 •

edited by davidwendt

Loading

nvdbaranec commented Feb 28, 2022

jrhemstad commented Feb 28, 2022

harrism commented Mar 15, 2022

harrism commented Mar 15, 2022

github-actions bot commented Apr 16, 2022

nvdbaranec commented Apr 19, 2022

harrism commented Apr 26, 2022

github-actions bot commented May 27, 2022

nvdbaranec commented May 27, 2022

github-actions bot commented Jun 26, 2022

nvdbaranec commented Jun 27, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Nov 29, 2022

wence- commented Aug 10, 2023

GregoryKimball commented Jan 16, 2025

Prevent grid stride loop overflow in libcudf kernels #10368

Prevent grid stride loop overflow in libcudf kernels #10368

Comments

nvdbaranec commented Feb 28, 2022 • edited by davidwendt Loading

Background

Part 1 - First pass fix kernels with this issue

Part 2 - Take another pass over more challenging kernels

Part 2b - results of searching

Additional information

nvdbaranec commented Feb 28, 2022

jrhemstad commented Feb 28, 2022

harrism commented Mar 15, 2022

harrism commented Mar 15, 2022

github-actions bot commented Apr 16, 2022

nvdbaranec commented Apr 19, 2022

harrism commented Apr 26, 2022

github-actions bot commented May 27, 2022

nvdbaranec commented May 27, 2022

github-actions bot commented Jun 26, 2022

nvdbaranec commented Jun 27, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Nov 29, 2022

wence- commented Aug 10, 2023

GregoryKimball commented Jan 16, 2025

nvdbaranec commented Feb 28, 2022 •

edited by davidwendt

Loading