Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "ranger" to prevent grid stride loop overflow #10368

Open
nvdbaranec opened this issue Feb 28, 2022 · 15 comments
Open

Use "ranger" to prevent grid stride loop overflow #10368

nvdbaranec opened this issue Feb 28, 2022 · 15 comments
Labels
1 - On Deck To be worked on next bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Feb 28, 2022

(updated Aug 2023)

Background

We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;  
while (output_index < output_size) {
  output_index += blockDim.x * gridDim.x;
}

If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).

We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.

Part 1 - First pass fix kernels with this issue

Source file Kernels Status
copying/concatenate.cu fused_concatenate_kernel #10448
valid_if.cuh valid_if_kernel #10448
scatter.cu marking_bitmask_kernel #10448
replace/nulls.cu replace_nulls_strings #10448
replace/nulls.cu replace_nulls #10448
rolling/rolling_detail.cuh gpu_rolling #10448
rolling/jit/kernel.cu gpu_rolling_new #10448
transform/compute_column.cu compute_column_kernel #10448
copying/concatenate.cu fused_concatenate_string_offset_kernel #13838
replace/replace.cu replace_strings_first_pass
replace_strings_second_pass
replace_kernel
#13905
copying/concatenate.cu concatenate_masks_kernel
fused_concatenate_string_offset_kernel
fused_concatenate_string_chars_kernel
fused_concatenate_kernel (int64)
#13906
hash/helper_functions.cuh init_hashtbl #13895
null_mask.cu set_null_mask_kernel
copy_offset_bitmask
count_set_bits_kernel
#13895
transform/row_bit_count.cu compute_row_sizes #13895
multibyte_split.cu multibyte_split_init_kernel
multibyte_split_seed_kernel (auto??)
multibyte_split_kernel
#13910
IO modules: parquet, orc, json #13910
io/utilities/parsing_utils.cu count_and_set_positions (uint64_t) #13910
conditional_join_kernels.cuh compute_conditional_join_output_size
conditional_join
#13971
merge.cu materialize_merged_bitmask_kernel #13972
partitioning.cu compute_row_partition_numbers
compute_row_output_locations
copy_block_partitions
#13973
json_path.cu get_json_object_kernel #13962
tdigest compute_percentiles_kernel (int) #13962
strings/attributes.cu count_characters_parallel_fn #13968
strings/convert/convert_urls.cu url_decode_char_counter (int)
url_decode_char_replacer (int)
#13968
text/subword/data_normalizer.cu kernel_data_normalizer (uint32_t) #13915
text/subword/subword_tokenize.cu kernel_compute_tensor_metadata (uint32_t) #13915
text/subword/wordpiece_tokenizer.cu init_data_and_mark_word_start_and_ends (uint32_t)
mark_string_start_and_ends (uint32_t)
kernel_wordpiece_tokenizer (uint32_t)
#13915

Part 2 - Take another pass over more challenging kernels

Source file Kernels Status
null_mash.cuh subtract_set_bits_range_boundaries_kernel
valid_if.cuh valid_if_n_kernel
copy_if_else.cuh copy_if_else_kernel
gather.cuh gather_chars_fn_string_parallel
more? search gridDim.x or blockDim.x to find more examples

Part 3 - Use ranger to prevent grid stride loop overflow

  • incorporate the ranger header as a libcudf utility
  • use ranger instead of manual indexing in libcudf kernels

Additional information

There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel, source_idx probably overflows, but harmlessly.

A snippet of code to see this in action:

size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);  
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{}); 

Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.

@nvdbaranec nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify labels Feb 28, 2022
@nvdbaranec
Copy link
Contributor Author

The fix in basically all of these cases is quite simple: just make the index a size_t

@jrhemstad
Copy link
Contributor

I'd love to just add an algorithm to do this.

https://godbolt.org/z/hK95z7zff

@harrism
Copy link
Member

harrism commented Mar 15, 2022

Or even a simple range helper for for loops: https://github.com/harrism/hemi#simple-grid-stride-loops

@harrism
Copy link
Member

harrism commented Mar 15, 2022

The fix in basically all of these cases is quite simple: just make the index a size_t

I think the general approach should be:

  1. Use an algorithm (thrust:: or std::) if possible before ever writing a custom kernel -- this way you write a per-element device functor instead, and indexing is handled for you.
  2. If a custom kernel must be written, it should use device-side algorithms instead of raw loops.
  3. If a raw grid-stride loop is required and an existing algorithm won't work, we should provide utilities to abstract the iteration and or the range and use auto for the type to avoid these mistakes.

rapids-bot bot pushed a commit that referenced this issue Mar 23, 2022
Partially addresses  #10368

Specifically:
- `valid_if`
- `scatter`
- `rolling_window`
- `compute_column_kernel` (ast stuff)
- `replace_nulls` (fixed-width and strings)

The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`.  Although scatter had an additional place it was overflowing outside the kernel.

I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10448
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@harrism
Copy link
Member

harrism commented Apr 26, 2022

Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Thanks @harrism for creating the ranger repo! Do you think we are ready to kick off an integration with libcudf, or does ranger need more development first?

@GregoryKimball GregoryKimball changed the title [BUG] Issues with grid stride loops overflowing in multiple cudf kernels. Use mharris' "ranger" to prevent grid stride loop overflow Apr 3, 2023
@GregoryKimball GregoryKimball changed the title Use mharris' "ranger" to prevent grid stride loop overflow Use "ranger" to prevent grid stride loop overflow Apr 3, 2023
@harrism
Copy link
Member

harrism commented Jun 6, 2023

@GregoryKimball I have now created a PR to use ranger in libcuspatial. You guys could use this as an example if you want to do the same in libcudf. rapidsai/cuspatial#1178

@wence-
Copy link
Contributor

wence- commented Aug 10, 2023

wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add -Wsign-conversion (not covered by -Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.

#include <cstdint>
int what(int upper)
{
  int i = 0; // no warning if this is a std::int64_t
  unsigned int stride = 10;
  while (i < upper) {
    i = i + stride; // clang warns for this, so does gcc
  }
  i = 0;
  while (i < upper) {
    i += stride; // gcc warns for this, clang does not.
  }
  return i;
}

rapids-bot bot pushed a commit that referenced this issue Sep 1, 2023
This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13996
rapids-bot bot pushed a commit that referenced this issue Sep 7, 2023
…joins (#13971)

See #10368 (and more recently #13771

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #13971
@vyasr vyasr removed the helps: Dask label Feb 23, 2024
rapids-bot bot pushed a commit that referenced this issue May 7, 2024
This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)

URL: #14107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Status: To be revisited
Development

No branches or pull requests

6 participants