Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overflow in fused_concatenate_kernel output_index #10344

Merged

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Feb 22, 2022

Fixes #10333.

The repro case in the issue showed an illegal access error where the output_index of the strided loop in fused_concatenate_kernel can overflow for a large number of rows.

For example, given 5 tables of exactly 250M rows each we would expect a result with 1,250,000,000 rows.

The kernel is launched with 4,882,813 blocks (# of rows / 256 threads rounded up) with a stride of 1,250,000,128 (256 * 4,882,813). When output_index reaches 897,483,520, it overflows output_index on the first iteration.

The change below prevents the overflow by making output_index an int64_t and adds a test that shows that we can now concatenate up to size_type::max - 1 rows.

@abellina abellina requested a review from a team as a code owner February 22, 2022 21:26
@abellina abellina added the non-breaking Non-breaking change label Feb 22, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 22, 2022
@abellina abellina added bug Something isn't working and removed libcudf Affects libcudf (C++/CUDA) code. labels Feb 22, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 22, 2022
@abellina
Copy link
Contributor Author

abellina commented Feb 22, 2022

It looks like the python builds are failing with:

[](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-python-build-arm64/CUDA=11.5,PYTHON=3.8/2670/console#L901)16:21:00 conda.CondaMultiError: Downloaded bytes did not match Content-Length
16:21:00   url: https://conda.anaconda.org/nvidia/linux-aarch64/cudatoolkit-11.5.1-h506b3e9_9.tar.bz2[](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-python-build-arm64/CUDA=11.5,PYTHON=3.8/2670/console#L902)
16:21:00   target_path: /opt/conda/envs/rapids/pkgs/cudatoolkit-11.5.1-h506b3e9_9.tar.bz2

@codecov
Copy link

codecov bot commented Feb 22, 2022

Codecov Report

Merging #10344 (3adaa31) into branch-22.04 (a7d88cd) will increase coverage by 0.19%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10344      +/-   ##
================================================
+ Coverage         10.42%   10.62%   +0.19%     
================================================
  Files               119      122       +3     
  Lines             20603    20973     +370     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18745     +290     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/strings/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf65ac3...3adaa31. Read the comment docs.

@@ -166,7 +166,7 @@ __global__ void fused_concatenate_kernel(column_device_view const* input_views,
auto const output_size = output_view.size();
auto* output_data = output_view.data<T>();

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;
std::size_t output_index = threadIdx.x + blockIdx.x * blockDim.x;
Copy link
Contributor

@ttnghia ttnghia Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using output_index in size_t may lead to some casting problem later, since it will be compared with many other indices in int32_t. We just need to detect if output_index here is overflow. So let's try:

Suggested change
std::size_t output_index = threadIdx.x + blockIdx.x * blockDim.x;
std::size_t const tmp_index = threadIdx.x + blockIdx.x * blockDim.x;
if(tmp_index > std::numeric_limit<size_type>::max()) { return; }
std::size_t output_index = static_cast<size_type>(tmp_index);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep with using a 64 bit integer type, but let's use int64_t instead of size_t to avoid unsigned comparison warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia I moved to int64_t given @jrhemstad's comment. I looked at the comparisons with output_index, and I believe types are upgraded correctly so we shouldn't see an issue at least as the code is written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't decltype(threadIdx.x + blockIdx.x * blockDim.x) == int32_t, because the operands are that type?
Would it be preferable to cast an operand to int64_t?

auto const tmp_index = threadIdx.x + static_cast<int64_t>(blockIdx.x)*blockDim.x;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the bug is actually later, and I don't think we can overflow the usual threadIdx.x + blockIdx.x * blockDim.x math given the 256 threads per block, the way we compute the number of blocks, and row limits in cuDF, if I understand your comment @mythrocks.

The real bug is later on at:

output_index += blockDim.x * gridDim.x;

as mentioned in the PR description, that is because output_index + (blockDim.x * gridDim.x) can be larger than int32_t.

But this brings up a point. fused_concatenate checks here that output_size needs to be less than size_type::max (2^31-1). That seems like a bug, but I am probably mistaken. Should it be output_size - 1 <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()? I.e. if my type supports numbers from 0 to 10, and I have 11 rows, that should be OK (I can address with index 0..10). The reason I ask for this is because I wanted to test I can materialize the maximum number of rows from the concatenate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an implicit limitation to INT_MAX - 1 elements in a column for several functions.

Copy link
Contributor Author

@abellina abellina Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an implicit limitation to INT_MAX - 1 elements in a column for several functions.

Ok, so it is a cuDF limitation or bug? I would have expected the number of elements in the column to be up to INT_MAX, so that we can index using 0..INT_MAX - 1.

Also I think the change should be to make the assert:

output_size <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()

But I am confused by your comment @jrhemstad. Should I make this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am confused by your comment @jrhemstad. Should I make this change?

Feel free to change it. I'm just saying, don't expect a column with INT_MAX elements to work everywhere 😉

@abellina
Copy link
Contributor Author

@jrhemstad @mythrocks are you OK if we merge this? I think it is ready if you are OK with it.

@jrhemstad
Copy link
Contributor

@gpucibot merge

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. I thought I'd approved this already.

@rapids-bot rapids-bot bot merged commit 64ee514 into rapidsai:branch-22.04 Feb 28, 2022
@abellina abellina deleted the fused_concatenate_illegal_access branch February 28, 2022 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] fused_concatenate_kernel can cause illegal memory access
4 participants