Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

64-bit Offsets in DeviceRadixSort #340

Merged
merged 1 commit into from
Jan 27, 2022

Conversation

canonizer
Copy link
Contributor

  • 64-bit OffsetT is supported for onesweep sorting
  • for decoupled look-back, the partition kernel is broken into smaller parts (as before), and a separate 32-bit type is used there
  • for histograms, 32-bit counters are used in shared memory and OffsetT-sized counters in global memory

@alliepiper alliepiper linked an issue Jul 7, 2021 that may be closed by this pull request
@alliepiper alliepiper added type: enhancement New feature or request. P1: should have Necessary, but not critical. labels Jul 8, 2021
@alliepiper alliepiper added this to the 1.14.0 milestone Jul 8, 2021
@alliepiper alliepiper removed this from the 1.14.0 milestone Aug 17, 2021
@alliepiper alliepiper marked this pull request as draft August 17, 2021 18:09
@canonizer canonizer changed the title [WIP] 64-bit Offsets in Onesweep Sorting 64-bit Offsets in DeviceRadixSort Oct 21, 2021
@canonizer canonizer marked this pull request as ready for review October 21, 2021 00:45
@alliepiper alliepiper added this to the 1.16.0 milestone Jan 10, 2022
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes suggested inline, but otherwise I'm happy with this 👍

It'd be good to rebase this on main to make sure that we didn't miss any other deprecated APIs from #364.

Once this is rebased and the comments addressed I'll start testing.

cub/agent/agent_radix_sort_histogram.cuh Show resolved Hide resolved
cub/block/radix_rank_sort_operations.cuh Outdated Show resolved Hide resolved
cub/agent/agent_radix_sort_histogram.cuh Outdated Show resolved Hide resolved
cub/device/device_radix_sort.cuh Outdated Show resolved Hide resolved
cub/device/device_radix_sort.cuh Outdated Show resolved Hide resolved
test/test_device_radix_sort.cu Show resolved Hide resolved
test/test_device_radix_sort.cu Outdated Show resolved Hide resolved
@canonizer
Copy link
Contributor Author

This pull request has been transformed into a single commit that has no conflicts with the main branch.

@alliepiper alliepiper self-assigned this Jan 14, 2022
@alliepiper
Copy link
Collaborator

Rebased to pull in recent fixes in main.

alliepiper added a commit to alliepiper/thrust that referenced this pull request Jan 19, 2022
@alliepiper
Copy link
Collaborator

gpuCI: NVIDIA/thrust#1593
DVS CL: 30893016

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Jan 19, 2022
@alliepiper
Copy link
Collaborator

There were a few issues in CI: NVIDIA/thrust#1593

From the clang-9 build, I didn't check the others:

Missing util_math.cuh header maybe?

/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/usr/bin/clang++-9 -DCUB_IGNORE_DEPRECATED_CPP_11 -DCUB_WRAPPED_NAMESPACE=wrapped_cub -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DTHRUST_WRAPPED_NAMESPACE=wrapped_thrust -I/workspace/dependencies/cub -I/workspace -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -O3 -DNDEBUG -Xcompiler=-Werror -Xcompiler=-Wall -Xcompiler=-Wextra -Xcompiler=-Winit-self -Xcompiler=-Woverloaded-virtual -Xcompiler=-Wcast-qual -Xcompiler=-Wpointer-arith -Xcompiler=-Wunused-local-typedef -Xcompiler=-Wvla -Xcompiler=-Wgnu -Xcompiler=-Wno-gnu-zero-variadic-macro-arguments -Xcompiler=-Wno-unused-function -Xcompiler=-Wno-deprecated-declarations -Xcudafe=--display_error_number -Xcudafe=--promote_warnings -Wno-deprecated-gpu-targets -std=c++11 -MD -MT dependencies/cub/CMakeFiles/cub.cpp11.headers.dir/headers/agent/agent_radix_sort_histogram.cuh.cu.o -MF dependencies/cub/CMakeFiles/cub.cpp11.headers.dir/headers/agent/agent_radix_sort_histogram.cuh.cu.o.d -x cu -c /workspace/build/dependencies/cub/headers/agent/agent_radix_sort_histogram.cuh.cu -o dependencies/cub/CMakeFiles/cub.cpp11.headers.dir/headers/agent/agent_radix_sort_histogram.cuh.cu.o
/workspace/dependencies/cub/cub/agent/agent_radix_sort_histogram.cuh(229): error: namespace "wrapped_cub::cub" has no member "DivideAndRoundUp"

Using ' as a digit separator isn't supported in C++11:

[6109/6477] Building CUDA object dependencies/cub/test/CMakeFiles/cub.cpp11.test.device_radix_sort.bytes_4.pairs_3.dir/test_device_radix_sort.cu.o
FAILED: dependencies/cub/test/CMakeFiles/cub.cpp11.test.device_radix_sort.bytes_4.pairs_3.dir/test_device_radix_sort.cu.o 
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/usr/bin/clang++-9 -DCUB_IGNORE_DEPRECATED_CPP_11 -DTEST_KEY_BYTES=4 -DTEST_VALUE_TYPE=3 -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -I/workspace/dependencies/cub/test -I/workspace/dependencies/cub -I/workspace -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -O3 -DNDEBUG -Xcompiler=-Werror -Xcompiler=-Wall -Xcompiler=-Wextra -Xcompiler=-Winit-self -Xcompiler=-Woverloaded-virtual -Xcompiler=-Wcast-qual -Xcompiler=-Wpointer-arith -Xcompiler=-Wunused-local-typedef -Xcompiler=-Wvla -Xcompiler=-Wgnu -Xcompiler=-Wno-gnu-zero-variadic-macro-arguments -Xcompiler=-Wno-unused-function -Xcompiler=-Wno-deprecated-declarations -Xcudafe=--display_error_number -Xcudafe=--promote_warnings -Wno-deprecated-gpu-targets -std=c++11 -MD -MT dependencies/cub/test/CMakeFiles/cub.cpp11.test.device_radix_sort.bytes_4.pairs_3.dir/test_device_radix_sort.cu.o -MF dependencies/cub/test/CMakeFiles/cub.cpp11.test.device_radix_sort.bytes_4.pairs_3.dir/test_device_radix_sort.cu.o.d -x cu -c /workspace/dependencies/cub/test/test_device_radix_sort.cu -o dependencies/cub/test/CMakeFiles/cub.cpp11.test.device_radix_sort.bytes_4.pairs_3.dir/test_device_radix_sort.cu.o
/workspace/dependencies/cub/test/test_device_radix_sort.cu:1366:66: error: missing terminating ' character [-Werror,-Winvalid-pp-token]
        const std::size_t large_num_items = std::size_t(4'350'000'007ull);

@alliepiper alliepiper assigned canonizer and unassigned alliepiper Jan 20, 2022
@alliepiper alliepiper removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Jan 20, 2022
@canonizer
Copy link
Contributor Author

Fixed both and rebased.

List of individual changes:

- Fixed test errors
- OffsetT == unsigned long long for the 64-bit case
- using std::{is_same,conditional}
- using "portion" consistently for 2^28-2^30-sized chunks of the input array
- HasEnoughMemory() takes overwrite into account.
- moved checking for enough memory earlier.
- added a CTA_SYNC() to the histogram kernel
- disabled tests with NumItemsT != int for segmented sort
- testing with 4.5 bln. items
- tests for different NumItemsT
- NumItemsT for all device sorting functions
- wrapped ChooseOffsetT into namespace detail
- fixed typos
- templatized the type of num_items in 2 methods of DeviceRadixSort
- tuned radix sort with 64-bit OffsetT for V100
- tuned for 64-bit OffsetT for A100
- separate tuning parameters for 64-bit OffsetT
- improved downsweep policy for GP100
- option for 64-bit num_items with 32-bit shared memory histogram counters.
- introduced PartOffsetT into Onesweep kernel.
  - OffsetT is now only used for offsets into the whole array
    (e.g. bin counts or global read/write offsets)
  - PartOffsetT is used for offsets that do not exceed a single part
    (e.g. decoupled look-back, block index, number of items inside a part)
  - this fixes problems when OffsetT is unsigned, and also contributes
    towards supporting 64-bit num_items
alliepiper added a commit to alliepiper/thrust that referenced this pull request Jan 24, 2022
@alliepiper
Copy link
Collaborator

Did one more rebase to bring in some CI fixes -- gpuCI should not have any failures if all goes well.

gpuCI: NVIDIA/thrust#1593
DVS CL: 30911286

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Jan 24, 2022
alliepiper added a commit to alliepiper/thrust that referenced this pull request Jan 25, 2022
alliepiper added a commit to alliepiper/thrust that referenced this pull request Jan 25, 2022
@alliepiper
Copy link
Collaborator

main should be clear now. Trying again.

gpuCI: NVIDIA/thrust#1593
DVS CL: 30914954

@alliepiper
Copy link
Collaborator

All set, thanks again @canonizer!

@alliepiper alliepiper merged commit 93f26ab into NVIDIA:main Jan 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Radix sort issues with cuda 11.4
2 participants