Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Seg fault when passing data pointer to thrust::sort #2606

Closed
1 task done
PointKernel opened this issue Oct 21, 2024 · 5 comments
Closed
1 task done

[BUG]: Seg fault when passing data pointer to thrust::sort #2606

PointKernel opened this issue Oct 21, 2024 · 5 comments
Labels
bug Something isn't working right.

Comments

@PointKernel
Copy link
Member

PointKernel commented Oct 21, 2024

Is this a duplicate?

Type of Bug

Runtime Error

Component

Thrust

Describe the bug

#include <thrust/device_vector.h>
#include <thrust/sort.h>

using Key = int;
  
constexpr std::size_t num_keys{100};

int main() {
    thrust::device_vector<Key> data(num_keys);
    thrust::sort(data.begin(), data.end()); // this is fine

    auto* data_begin = data.data().get();
    thrust::sort(data_begin, data_begin + num_keys); // seg fault
}

cuda-gdb log:

(cuda-gdb) run
Starting program: /home/yunsongw/Work/cuCollections/build/examples/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff57ff000 (LWP 124655)]
[New Thread 0x7fffeffff000 (LWP 124656)]
[Detaching after fork from child process 124657]
[New Thread 0x7fffed7bc000 (LWP 124669)]
[New Thread 0x7fffecfbb000 (LWP 124670)]

Thread 1 "STATIC_MULTISET" received signal SIGSEGV, Segmentation fault.
thrust::system::detail::sequential::radix_sort_detail::radix_sort<8u, false, thrust::system::cpp::detail::tag, int*, thrust::detail::normal_iterator<thrust::pointer<int, thrust::system::cpp::detail::tag, thrust::use_default, thrust::use_default> >, int*, int*> (keys1=0x7fffbd200000, keys2=..., N=100, vals2=<optimized out>, vals1=<optimized out>, exec=...)
    at /home/yunsongw/Work/cuCollections/build/_deps/cccl-src/thrust/thrust/cmake/../../thrust/system/detail/sequential/stable_radix_sort.inl:93
93	    return x ^ static_cast<unsigned int>(1) << (8 * sizeof(unsigned int) - 1);

The code above will cause a segmentation fault when running on an RTX8000 (sm75), but the same issue doesn't occur on a V100.

How to Reproduce

https://godbolt.org/z/rxjYrWj1E

Uncovered this issue when working on NVIDIA/cuCollections#623 (comment)

cuco uses rapids-cmake to fetch a specific commit of CCCL: e21d607 but the error seems to be consistent with the trunk (see godbolt link)

Expected behavior

The code should run without seg fault.

@PointKernel PointKernel added the bug Something isn't working right. label Oct 21, 2024
@jrhemstad
Copy link
Collaborator

jrhemstad commented Oct 21, 2024

When you pass an iterator range to a thrust call without an execution policy, it attempts to deduce the correct execution policy based on the type of the input iterator.

In the first example:

    thrust::device_vector<Key> data(num_keys);
    thrust::sort(data.begin(), data.end()); 

This works because data.begin() returns an iterator whose type is tied to device_vector and so Thrust can deduce to use the device execution policy.

When a raw pointer is passed in, Thrust can't infer from the type which execution policy to use and defaults to host. So that's why you get a segfault.

Personally, I always recommend being explicit and passing in an execution policy and not relying on the inferred dispatch to avoid surprises like this.

@leofang
Copy link
Member

leofang commented Oct 21, 2024

Can this be turned into a compile-time error?

@PointKernel
Copy link
Member Author

When a raw pointer is passed in, Thrust can't infer from the type which execution policy to use and defaults to host.

That's aligned with my assumption as well but I'm not sure why it's working on some GPUs like V100 and A100 but not for RTX8000.

I always recommend being explicit and passing in an execution policy and not relying on the inferred dispatch to avoid surprises like this.

Makes sense.

@jrhemstad
Copy link
Collaborator

Can this be turned into a compile-time error?

We could, but it would make passing any raw pointers into a thrust algorithm without an execution policy into a compile error which would likely break plenty of existing working code.

@leofang
Copy link
Member

leofang commented Oct 21, 2024

As a library developer, hitting a segfault at run time is the 2nd worst kind of errors (the first is silent data corruption/incorrect result). I would much rather I hit this myself at compile time so that I can fix it all for my users. Just my 2c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Done
Development

No branches or pull requests

3 participants