Faster Least Significant Digit Radix Sort Implementation #204

canonizer · 2020-09-24T23:57:19Z

radix sort with decoupled look-back, 8 bits per pass and other optimizations
pull request to the previous CUB repository: Faster Least Significant Digit Radix Sort Implementation brycelelbach/cub_historical_2019_2020#26

alliepiper · 2020-09-28T17:30:44Z

Looks good on nvc++, now retesting DVS CL 29114896.

alliepiper · 2020-10-05T20:47:04Z

There are still some issues with our internal testing system that are preventing me from ok'ing this just yet. Hopefully this will be resolved soon.

alliepiper · 2020-10-21T20:31:02Z

DVS-AUS only CL: 29227867

alliepiper · 2020-10-26T18:09:47Z

Rebased and squashed in last push.

Another DVS-AUS CL: 29243517

alliepiper · 2020-10-27T22:05:01Z

cub/device/dispatch/dispatch_radix_sort.cuh

+#if defined(__NVCOMPILER_CUDA__)
+        typedef OffsetT AtomicOffsetT;
+#else
+        typedef cuda::atomic<OffsetT, cuda::thread_scope_device> AtomicOffsetT;
+#endif


There's still an issue using libcu++ atomics in this patch:

#if defined(__CUDA_ARCH__) && ((!defined(_MSC_VER) && __CUDA_ARCH__ < 600) || (defined(_MSC_VER) && __CUDA_ARCH__ < 700)) # error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows." #endif

Since this is host code, we need this to work across all SM versions.

It looks like we'll need to move the AtomicOffsetT into the per-SM policies, accounting for the different minimum version on windows. However, that check is at file-scope in cuda\std\detail\__atomic, so just including the atomic header while targeting older SMs is problematic.

I'll check with the libcu++ folks about this, maybe they know of a workaround.

alliepiper · 2020-10-30T14:32:59Z

cub/agent/agent_radix_sort_onesweep.cuh

+                //(volatile OffsetT&)loc = value;
+                ThreadStore<STORE_CG>(&loc, value);


Why not STORE_VOLATILE to match the previous behavior? Was volatile too strong here?

Was this the reason for the ~10% perf boost on GP100?

alliepiper · 2020-10-30T14:39:21Z

DVS CL: 29264925.

alliepiper · 2020-11-03T19:09:39Z

Rebased and squash, resubmitted DVS CL 29278629.

jlebar · 2020-11-05T17:06:20Z

🎉 I'm excited to try this out.

For my info, under what circumstances should I expect to see a speedup with this patch?

elstehle · 2020-11-05T17:30:36Z

Indeed, great work and congratulations on landing this! 👍
@jlebar: @canonizer has presented this at this year's GTC. Here's a preview on the performance gains:
Slides | Recording

Regarding "under what circumstances". From what I see in the tuning policies in cub/device/dispatch/dispatch_radix_sort.cuh, it's for all GPU architectures of Pascal onwards, once the keys are at least 32 bits wide.

I'll leave it to the author to provide more details... 😃

canonizer · 2020-11-05T17:52:06Z

Thanks @elstehle for summing it up. Yes, the new approach is now used for 32-bit and 64-bit keys for Pascal and above.

The main reason it is not enabled for earlier architectures is that I haven't done any performance experiments with them. If anyone does it and gets a speedup, they're welcome to send a pull request.

jlebar · 2020-11-10T01:54:08Z

Thanks!

I ran some benchmarks on my RTX 2080 and didn't see a significant change for the <int, int> sort I'm doing. It's a relatively small sort though, only 300k elements and not even sorting on all 32 bits.

jlebar · 2020-11-10T04:14:41Z

Actually, I take it back -- PEBKAC. I get a nice speedup from this change. It's only zero speedup if I don't apply the patch. :)

canonizer mentioned this pull request Sep 25, 2020

Faster Least Significant Digit Radix Sort Implementation brycelelbach/cub_historical_2019_2020#26

Closed

alliepiper added this to the 1.11.0 milestone Sep 25, 2020

alliepiper self-assigned this Sep 28, 2020

alliepiper added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. labels Sep 28, 2020

alliepiper assigned brycelelbach and unassigned alliepiper Oct 5, 2020

alliepiper added the blocked Currently cannot make progress. label Oct 5, 2020

alliepiper removed the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 19, 2020

alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 21, 2020

alliepiper force-pushed the sort branch from 520a7ac to 6405882 Compare October 26, 2020 18:08

alliepiper reviewed Oct 27, 2020

View reviewed changes

alliepiper mentioned this pull request Oct 28, 2020

Abstract the decoupled-lookback pattern into a maintainable, stable, public API #226

Closed

alliepiper unassigned brycelelbach Oct 29, 2020

alliepiper reviewed Oct 30, 2020

View reviewed changes

Faster Least Significant Digit Radix Sort Implementation

c182515

alliepiper force-pushed the sort branch from 80bb75e to c182515 Compare November 3, 2020 19:07

alliepiper added testing: gpuCI in progress Started gpuCI testing. and removed testing: gpuCI passed Passed gpuCI testing. labels Nov 3, 2020

alliepiper added testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Nov 4, 2020

alliepiper approved these changes Nov 5, 2020

View reviewed changes

alliepiper merged commit 9ff77e3 into NVIDIA:main Nov 5, 2020

alliepiper mentioned this pull request Nov 12, 2020

Retune radix sort, run length encoding, reduce by key, scan, select if, and histogram for SM70 and SM80 #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Least Significant Digit Radix Sort Implementation #204

Faster Least Significant Digit Radix Sort Implementation #204

canonizer commented Sep 24, 2020

alliepiper commented Sep 28, 2020

alliepiper commented Oct 5, 2020

alliepiper commented Oct 21, 2020

alliepiper commented Oct 26, 2020

alliepiper Oct 27, 2020

alliepiper Oct 30, 2020

alliepiper Oct 30, 2020

alliepiper commented Oct 30, 2020

alliepiper commented Nov 3, 2020

jlebar commented Nov 5, 2020

elstehle commented Nov 5, 2020 •

edited

Loading

canonizer commented Nov 5, 2020

jlebar commented Nov 10, 2020

jlebar commented Nov 10, 2020

		//(volatile OffsetT&)loc = value;
		ThreadStore<STORE_CG>(&loc, value);

Faster Least Significant Digit Radix Sort Implementation #204

Faster Least Significant Digit Radix Sort Implementation #204

Conversation

canonizer commented Sep 24, 2020

alliepiper commented Sep 28, 2020

alliepiper commented Oct 5, 2020

alliepiper commented Oct 21, 2020

alliepiper commented Oct 26, 2020

alliepiper Oct 27, 2020

Choose a reason for hiding this comment

alliepiper Oct 30, 2020

Choose a reason for hiding this comment

alliepiper Oct 30, 2020

Choose a reason for hiding this comment

alliepiper commented Oct 30, 2020

alliepiper commented Nov 3, 2020

jlebar commented Nov 5, 2020

elstehle commented Nov 5, 2020 • edited Loading

canonizer commented Nov 5, 2020

jlebar commented Nov 10, 2020

jlebar commented Nov 10, 2020

elstehle commented Nov 5, 2020 •

edited

Loading