Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Faster Least Significant Digit Radix Sort Implementation #204

Merged
merged 1 commit into from
Nov 5, 2020

Conversation

canonizer
Copy link
Contributor

@alliepiper
Copy link
Collaborator

Looks good on nvc++, now retesting DVS CL 29114896.

@alliepiper alliepiper added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. labels Sep 28, 2020
@alliepiper
Copy link
Collaborator

There are still some issues with our internal testing system that are preventing me from ok'ing this just yet. Hopefully this will be resolved soon.

@alliepiper alliepiper assigned brycelelbach and unassigned alliepiper Oct 5, 2020
@alliepiper alliepiper added the blocked Currently cannot make progress. label Oct 5, 2020
@alliepiper alliepiper removed the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 19, 2020
@alliepiper
Copy link
Collaborator

DVS-AUS only CL: 29227867

@alliepiper alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 21, 2020
@alliepiper
Copy link
Collaborator

Rebased and squashed in last push.

Another DVS-AUS CL: 29243517

Comment on lines 1371 to 1375
#if defined(__NVCOMPILER_CUDA__)
typedef OffsetT AtomicOffsetT;
#else
typedef cuda::atomic<OffsetT, cuda::thread_scope_device> AtomicOffsetT;
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an issue using libcu++ atomics in this patch:

#if defined(__CUDA_ARCH__) && ((!defined(_MSC_VER) && __CUDA_ARCH__ < 600) || (defined(_MSC_VER) && __CUDA_ARCH__ < 700))
#  error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
#endif

Since this is host code, we need this to work across all SM versions.

It looks like we'll need to move the AtomicOffsetT into the per-SM policies, accounting for the different minimum version on windows. However, that check is at file-scope in cuda\std\detail\__atomic, so just including the atomic header while targeting older SMs is problematic.

I'll check with the libcu++ folks about this, maybe they know of a workaround.

Comment on lines 214 to 215
//(volatile OffsetT&)loc = value;
ThreadStore<STORE_CG>(&loc, value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not STORE_VOLATILE to match the previous behavior? Was volatile too strong here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the reason for the ~10% perf boost on GP100?

@alliepiper
Copy link
Collaborator

DVS CL: 29264925.

@alliepiper
Copy link
Collaborator

Rebased and squash, resubmitted DVS CL 29278629.

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. and removed testing: gpuCI passed Passed gpuCI testing. labels Nov 3, 2020
@alliepiper alliepiper added testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Nov 4, 2020
@alliepiper alliepiper merged commit 9ff77e3 into NVIDIA:main Nov 5, 2020
@jlebar
Copy link

jlebar commented Nov 5, 2020

🎉 I'm excited to try this out.

For my info, under what circumstances should I expect to see a speedup with this patch?

@elstehle
Copy link
Collaborator

elstehle commented Nov 5, 2020

Indeed, great work and congratulations on landing this! 👍
@jlebar: @canonizer has presented this at this year's GTC. Here's a preview on the performance gains:
Slides | Recording

Regarding "under what circumstances". From what I see in the tuning policies in cub/device/dispatch/dispatch_radix_sort.cuh, it's for all GPU architectures of Pascal onwards, once the keys are at least 32 bits wide.

I'll leave it to the author to provide more details... 😃

@canonizer
Copy link
Contributor Author

Thanks @elstehle for summing it up. Yes, the new approach is now used for 32-bit and 64-bit keys for Pascal and above.

The main reason it is not enabled for earlier architectures is that I haven't done any performance experiments with them. If anyone does it and gets a speedup, they're welcome to send a pull request.

@jlebar
Copy link

jlebar commented Nov 10, 2020

Thanks!

I ran some benchmarks on my RTX 2080 and didn't see a significant change for the <int, int> sort I'm doing. It's a relatively small sort though, only 300k elements and not even sorting on all 32 bits.

@jlebar
Copy link

jlebar commented Nov 10, 2020

Actually, I take it back -- PEBKAC. I get a nice speedup from this change. It's only zero speedup if I don't apply the patch. :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked Currently cannot make progress. testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants