-
Notifications
You must be signed in to change notification settings - Fork 449
Faster Least Significant Digit Radix Sort Implementation #204
Conversation
canonizer
commented
Sep 24, 2020
- radix sort with decoupled look-back, 8 bits per pass and other optimizations
- pull request to the previous CUB repository: Faster Least Significant Digit Radix Sort Implementation brycelelbach/cub_historical_2019_2020#26
Looks good on nvc++, now retesting DVS CL 29114896. |
There are still some issues with our internal testing system that are preventing me from ok'ing this just yet. Hopefully this will be resolved soon. |
DVS-AUS only CL: 29227867 |
Rebased and squashed in last push. Another DVS-AUS CL: 29243517 |
#if defined(__NVCOMPILER_CUDA__) | ||
typedef OffsetT AtomicOffsetT; | ||
#else | ||
typedef cuda::atomic<OffsetT, cuda::thread_scope_device> AtomicOffsetT; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still an issue using libcu++ atomics in this patch:
#if defined(__CUDA_ARCH__) && ((!defined(_MSC_VER) && __CUDA_ARCH__ < 600) || (defined(_MSC_VER) && __CUDA_ARCH__ < 700))
# error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
#endif
Since this is host code, we need this to work across all SM versions.
It looks like we'll need to move the AtomicOffsetT
into the per-SM policies, accounting for the different minimum version on windows. However, that check is at file-scope in cuda\std\detail\__atomic
, so just including the atomic header while targeting older SMs is problematic.
I'll check with the libcu++ folks about this, maybe they know of a workaround.
//(volatile OffsetT&)loc = value; | ||
ThreadStore<STORE_CG>(&loc, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not STORE_VOLATILE
to match the previous behavior? Was volatile too strong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the reason for the ~10% perf boost on GP100?
DVS CL: 29264925. |
Rebased and squash, resubmitted DVS CL 29278629. |
🎉 I'm excited to try this out. For my info, under what circumstances should I expect to see a speedup with this patch? |
Indeed, great work and congratulations on landing this! 👍 Regarding "under what circumstances". From what I see in the tuning policies in cub/device/dispatch/dispatch_radix_sort.cuh, it's for all GPU architectures of Pascal onwards, once the keys are at least 32 bits wide. I'll leave it to the author to provide more details... 😃 |
Thanks @elstehle for summing it up. Yes, the new approach is now used for 32-bit and 64-bit keys for Pascal and above. The main reason it is not enabled for earlier architectures is that I haven't done any performance experiments with them. If anyone does it and gets a speedup, they're welcome to send a pull request. |
Thanks! I ran some benchmarks on my RTX 2080 and didn't see a significant change for the <int, int> sort I'm doing. It's a relatively small sort though, only 300k elements and not even sorting on all 32 bits. |
Actually, I take it back -- PEBKAC. I get a nice speedup from this change. It's only zero speedup if I don't apply the patch. :) |