Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

In-place guarantees for sort #499

Merged
merged 4 commits into from
Jun 25, 2022

Conversation

gevtushenko
Copy link
Collaborator

This PR clarified documentation on in-place guarantees for our sort facilities. I'll briefly mention the motivation behind these guarantees:

  1. segmented sort:
    1.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, sort facilities for small segments bypass this storage. This facilities use LOAD_LDG, which prohibits reading and writing into the same memory. Potentially, we might permit this overload to work in-place, but it'd take significant effort for re-tuning. It'd also create a false impression that the double buffer overload can alias data as well. I suggest we postpone this until there's a request. If there's one, we could create an overload that takes one set of values to emphasize in-place guarantees.
    1.2. double buffer: inherits this limitation from segmented radix sort.
  2. segmented radix sort:
    2.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, short-cut single tile kernel exist. This kernel bypasses the intermediate storage and uses LOAD_LDG. Motivation is the same as for the segmented sort. Besides that, we have a plan to rewrite this kernel. New kernel would be restricted by the LOAD_LDG as well.
    2.2. double buffer: Main algorithm consists of three repeating steps. At the upsweep step, CTA loads keys and converts them into bin ids. These ids are used to calculate a private histogram of keys assigned to it. At the scan step, private histograms
    are converted into a prefix sum that represents offsets of each private bin. At the final downsweep step, keys are loaded again to compute the local bin id as well as the offset within the local bin. This process is repeated until all bytes of keys are covered. Even in the serial implementation, radix sort version that is used in cub doesn't allow in-place execution. There's a data race at the downsweep step if the input and output arrays are aliased. Some CTA might overwrite k1 by a k2 before k1 is read. In this case, k1 will be lost and k2 will be stored elsewhere, overwritting some k3.
  3. radix sort:
    3.1 in/out: same motivation as in segmented version
    3.2 double buffer: same motivation as in segmented version. Besides that, there's a one-sweep version of the algorithm that's based on the decoupled lookback. It's still unsafe to alias input and output arrays there because successor CTA might read overwritten data.

@gevtushenko gevtushenko requested a review from alliepiper May 31, 2022 07:29
@gevtushenko gevtushenko added only: docs Documentation changes only. Doesn't need code CI. area: docs Related to documentation. labels May 31, 2022
@canonizer
Copy link
Contributor

To clarify why there's a problem in radix-sort (both upsweep/downsweep and onesweep).

@senior-zero As you've mentioned, if for a binning iteration (the downsweep or onesweep kernel), the input and output buffers are the same, there would be a race condition. Therefore, those buffers should be different.

Regarding whether it's possible for the input and output buffers to be the same (in the in/out version), it really depends on the number of binning iterations. If it's even, it's possible without extra overhead (the input data will be overwritten, of course). If it's odd, then it's only possible at the cost of an extra copy (again, the input data is overwritten). This is because the in/out version allocates extra space, the size of the input buffer.

The real problem, however, is that it's hard to guarantee that the number of binning iterations would be even. It depends on begin_bit, end_bit and the digit size (RADIX_BITS), which, in turn, depends on the GPU used. And if in-place sorting is used with the odd number of binning iterations, there's the copy overhead, which is non-negligible.

So if in-place-like behavior is desired, it's always best to use the double-buffer version. There, it doesn't really matter in which of the buffers the end result ends up, so it doesn't matter if the number of binning iterations is odd or even. In addition, it doesn't allocate any temporary storage that's equal in size to the input buffer, and so uses less memory.

@alliepiper alliepiper added type: enhancement New feature or request. P1: should have Necessary, but not critical. labels Jun 3, 2022
@alliepiper alliepiper added this to the 2.0.0 milestone Jun 3, 2022
@gevtushenko gevtushenko merged commit dca1f11 into NVIDIA:main Jun 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area: docs Related to documentation. only: docs Documentation changes only. Doesn't need code CI. P1: should have Necessary, but not critical. type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants