-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Cluster Charge Cut, optimize Histogram (bucket sorting) #171
Changes from 31 commits
eb29a8b
fe2af2e
8b47a72
90344c7
b1e7746
dca2f1e
6a06ece
50460d9
9103a9c
f05df88
304c607
b1e2810
f9dad13
81a15e6
8eb1210
960de3d
25b24c5
7097934
a108571
c711b5d
83879cd
ba38606
b8df4f8
1bb3bda
78e4aa5
0c19cd1
1b55356
70d2fcb
47a3f04
85453af
d93e14f
bda658a
75f2948
3f69a40
a960668
742399a
0e2812c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
<use name="cuda"/> | ||
<use name="cub"/> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
#ifndef HeterogeneousCore_CUDAUtilities_prefixScan_h | ||
#define HeterogeneousCore_CUDAUtilities_prefixScan_h | ||
|
||
#include <cstdint> | ||
#include <cassert> | ||
|
||
template<typename T> | ||
__device__ | ||
void | ||
__forceinline__ | ||
warpPrefixScan(T * c, uint32_t i) { | ||
auto x = c[i]; | ||
auto laneId = threadIdx.x & 0x1f; | ||
#pragma unroll | ||
for( int offset = 1 ; offset < 32 ; offset <<= 1 ) { | ||
auto y = __shfl_up_sync(0xffffffff,x, offset); | ||
if(laneId >= offset) x += y; | ||
} | ||
c[i] = x; | ||
} | ||
|
||
// limited to 32*32 elements.... | ||
template<typename T> | ||
__device__ | ||
void | ||
__forceinline__ | ||
blockPrefixScan(T * c, uint32_t size, T* ws) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CUB seems to provide something similarly named
Is there a reason for not using them (e.g. they do different thing, interface is awkward etc)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. interface is a bit awkward, maybe I need to look at it again (now that I managed to use the host API) |
||
assert(size<=1024); | ||
assert(0==blockDim.x%32); | ||
|
||
auto first = threadIdx.x; | ||
|
||
for (auto i=first; i<size; i+=blockDim.x) { | ||
warpPrefixScan(c,i); | ||
auto laneId = threadIdx.x & 0x1f; | ||
auto warpId = i/32; | ||
assert(warpId<32); | ||
if (31==laneId) ws[warpId]=c[i]; | ||
} | ||
__syncthreads(); | ||
if (size<=32) return; | ||
if (threadIdx.x<32) warpPrefixScan(ws,threadIdx.x); | ||
__syncthreads(); | ||
for (auto i=first+32; i<size; i+=blockDim.x) { | ||
auto warpId = i/32; | ||
c[i]+=ws[warpId-1]; | ||
} | ||
__syncthreads(); | ||
} | ||
|
||
|
||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function actually called anywhere (besides a printout)? Or is the
temp_storage_size
in practice always smaller than4*totbins()
(that IIUC gets used for the temporary storage infInclusiveSum
)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle it should be used, I verified that temp_storage_size in practice always smaller than 4*totbins()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I was also waiting that we define a strategy for a reusable ws...