-
Notifications
You must be signed in to change notification settings - Fork 449
Adds DeviceBatchMemcpy algorithm and tests #359
Adds DeviceBatchMemcpy algorithm and tests #359
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! You've done tremendous work here. It's also a vital algorithm to have. I was a bit concerned with the complexity of implementation, though. I've decided to benchmark this algorithm and found a strange performance drop. For buffers of 256 std::uint32_t items the performance seems quite impressive.
CUB here denotes your implementation, and the memcpy represents the bandwidth of cudaMemcpyAsync applied to the sum of all buffer sizes.
But when I've changed the underlying type to std::uint64_t (that is, increased the buffer size twice), I've observed the following.
The code produces the correct result, so I'm not sure what's the reason. At this point, I've decided to check a different approach. I applied a three-way partition which produced reorderings for small/medium/large segments. Then I used existing facilities to copy data.
To handle large buffers by multiple thread blocks, I used atomic operations. I increment a value assigned to a large buffer to get the tile position in that buffer. Here are the results where I vary the number of buffers with a fixed size of 64 MB.
The simple approach seems to perform a bit better in this particular case. I also checked the same test for 1GB segments, and it's still better there.
Here's the bandwidth for copying extremely small buffers - 2 items of std::uint32_t type:
Another interesting question is the temporary storage size. You currently require four num_buffers. I've managed to use only two num_buffers.
To conclude the data above, the simple implementation:
- uses two times less memory
- is faster in some cases
- almost fits into the screen 😄
- uses existing facilities.
It's just a proof of concept. Therefore I haven't considered unaliased buffers or sizes that are not a multiple of 32 bits. I don't expect this to change the results significantly, though. Anyway, we might consider requiring this if this is the source of performance issues. Padding arrays is quite common practice.
Could you consider the simple algorithm and check how it can be helpful in your case? Even if the algorithm I've mentioned happens to be slower, I hope you'll be able to incorporate some of the ideas as building blocks of the proposed PR. For example, it could be used to deal with small numbers of buffers. Here is the CUB branch I've used for testing, and here is the benchmark and partition-based implementation.
I am looking forward to checking your results on the second stage of review!
Thanks for the feedback and the preliminary evaluation, @senior-zero 👍 Fundamentally, our ideas are quite similar. You do a three-way partition on all the problems. I proposed to have a kernel-fused version of the "three-way partitioning" that is fused with the implementation for copying small and medium buffers. The goal being that we can solve small and medium buffers straight in the kernel instead of having to write them into a "queue" first and later read them back in. I wanted to circumvent the extra reads of the problems' sizes and writing their id out, as well as another extra read of the partitioned id. This definitely makes the implementation more complex and, I totally agree, I'm not sure if that complexity is worth it. When I had conceived this, I assumed the "worst case" scenario. In theory, let's assume these type sizes: Now, we also see that, unfortunately, we cannot sustain anywhere near peak memory bandwidth for such tiny buffers. So question is whether we want to take the theoretical model into consideration at all. I see the three decisions we need to make: (2) What I also like is using atomics for the scheduling/load-balancing of large buffers. The performance drop you see going from 1KB to 2KB buffers is a combination of a configuration discrepancy (my bad) and general performance regression when the tile size (or "task" size, i.e., the most granular unit getting assigned to thread blocks) is too small. The binary search seems to dominate in that case. I also want to see if streaming reads and writes will alleviate this. So we'll also need to compare these two mechanisms and factor out other side effects too. (3) What is left, is the actual implementation of how we're copying small buffers, medium buffers, and large buffers, respectively. I think this it is easy to exchange one for the other. Once we figured out the former two decisions, this will be easy. So I would proceed in that order. Does that sound good? As for:
That can easily be done for the kernel-fused version too, right? It's just a matter of trading memory for more coalesced accesses. I.e., I'm materialising the buffer's source and destination pointers for large buffers instead of having the indirection. I'm also fine to have indirection in this particular case.
I'm all in for fast 😁 We just need to have a more differentiated and elaborate evaluation to track down where the difference actually comes from.
💯
I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block. |
I've long wanted a |
1e96c33
to
d568803
Compare
d568803
to
4f44fae
Compare
I'm currently gathering results of a few more benchmarks that hopefully will help us make an informed decision about which of the scheduling mechanisms to pursue (preliminary three-way partition vs. single-pass prefix scan-based). I'll post the results shortly. In the meanwhile, PR #354, on which this PR builds, should be ready for review. |
FYI, I'm starting the 1.15 RC next week so I'm bumping this to 1.16. I'll try to get to NVIDIA/cccl#1006 before the release. |
So I ran the first batch of benchmarks. I'll add more throughout the week. Methodology
Compilation example / details
Copying of small buffers logic
No Aliased Loads, No Buffer Size Variance
Data
Scheduling: TWP vs. SPPS; No Aliased Loads, No Buffer Size Variance
Here, the small buffer copying logic from TWP was moved into SPPS. Hence, we aim to limit the difference to be the scheduling (i.e., the partitioning into small, medium, and large buffers). Data
No Aliased Loads, Varying Buffer Size
We now look at varying buffer sizes, where buffer sizes are uniformly distributed in Data
16B-aligned buffers, 4B-aliased copies, Varying Buffer Size
This experiment analyses the benefit that aliasing loads (i.e., Data
|
8f6d447
to
f657812
Compare
Sorry for the wait. I did another clean up pass over the code of this PR.
Agreed, @jrhemstad. I believe there's a recurring need for it, especially when dealing with string data. I often find myself needing to load string data into shared memory for further processing, so I tried to be transparent to the destination data space. I.e., to support vectorised stores of I hope I've been able to take a first step into that direction. In the interest of getting this PR through, I haven't exposed it as a stand-alone CG/block-level algorithm yet and hidden it under the
|
We don't expose CG in the CUB APIs, this would require some more discussion before we added anything like that. That may be better suited to the senders/receivers based APIs that @senior-zero is working on. For now, let's try to find a way to pass the same info in without adding any dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some issues in the example. Please, check if it's on the algorithm side or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review until the agent
if (offset == 0) | ||
{ | ||
LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out); | ||
} | ||
// Otherwise, we need to load extra bytes and perform funnel-shifting | ||
else | ||
{ | ||
LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there would be any advantage to dispatching to where the offset is statically known. It looks like that would allow bit_shift
to be known statically as well.
if (offset == 0) | |
{ | |
LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out); | |
} | |
// Otherwise, we need to load extra bytes and perform funnel-shifting | |
else | |
{ | |
LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out); | |
} | |
switch(offset) | |
case 0: LoadVectorAndFunnelShiftR<0>(...) | |
case 1: LoadVectorAndFunnelShiftR<1>(...) | |
case 2: LoadVectorAndFunnelShiftR<2>(...) | |
case 3: LoadVectorAndFunnelShiftR<3>(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea! Preliminary results look like it does not cut above the noise, but I'll do a more thorough run and follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ran some more benchmarks on this suggestion with code paths that use immediate bit-shift values. Performance remained unchanged for DeviceMemcpy::Batched
.
My hypothesis is that we're bottlenecked by the memory subsystem, as I also don't see significant performance changes for some other changes that I'd expect to otherwise positively impact performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work seems to be in progress, so I'll finish review for now to make comments visible.
|
||
// Ensure the prefix callback has finished using its temporary storage and that it can be reused | ||
// in the next stage | ||
CTA_SYNC(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp_storage.blev_buffer_offset
is not used in PartitionBuffersBySize
. Since look-back is rather expensive, do you think there's any advantage in overlapping of decoupled look-back with the PartitionBuffersBySize
in other warps? BLevBuffScanPrefixCallbackOpT
storage should be like 4 ints, so putting it into struct instead of a union shouldn't increase shared memory requirements significantly, but we would get rid of one sync and overlap some operations.
f321135
to
fa9c1bd
Compare
Algorithm Overview
The
DeviceBatchMemcpy
takesN
input buffers andN
output buffers and copiesbuffer_size[i]
bytes from thei
-th input buffer to thei
-th output buffer. If any input buffer aliases memory from any output buffer the behavior is undefined. If any output buffer aliases memory of another output buffer the behavior is undefined. Input buffers can alias one another.Implementation Details
We distinguish each buffer by its size and assign it to one of three size classes:
32
bytes).32
bytes but only up to1024
bytes).1024
bytes).Step 1: Partitioning Buffers by Size
buffer_size[i]
.buffer_size[ITEMS_PER_THREAD]
chunk. Binning buffers by the size class they fall into{tile_buffer_id, buffer_size}
, wheretile_buffer_id
is the buffer id, relative to the tile (i.e., from the interval[0, TILE_SIZE)
).buffer_size
is only defined for buffers that belong to thetlev
partition and corresponds to the buffer's size (number of bytes) in that case.Note, the partitioning does not necessarily need to be stable. It may be desired if we expect neighbouring buffers to hold neighbouring byte segments.
After the partitioning, each partition represents all the buffers that belong to the respective size class (i.e., one of
TLEV
,WLEV
,BLEV
). Depending on the size class, a different logic is applied. We process each partition separately.Step 2.a: Copying TLEV Buffers
Usually, TLEV buffers are buffers of only a few bytes. Vectorised loads and stores do not really pay off here, as there's only few bytes that can actually be read from a four byte-aligned address. It does not pay off to have the two different code paths for (a) loading individual bytes from non-aligned adrresses and (b) doing vectorised loads from aligned addresses.
Instead, we use the
BlockRunLengthDecode
algorithm to both (a) coalesce reads and writes as well as (b) load balance the number of bytes copied by each thred. Specifically, we are able to assign neighbouring bytes to neighbouring threads.The following tables illustrates how the first
8
bytes from the TLEV buffers are getting assigned to threads.[1] Use
BlockRunLengthDecode
using thetile_buffer_id
as the "unique_items
" and each buffer's size as the respective run's length. The result from the run-length decode yields the assignment from threads to the buffer along with the specific byte from that buffer.Step 2.b: Copying WLEV Buffers
A full warp is assigned to each WLEV buffer. Loads from the input buffer are vectorised (aliased to a wider data type), loading
4
,8
or even16
bytes at a time from the input buffer's first address that is aligned to such aliased data type. The implementation for the vectorised copy is based on @gaohao95's (thanks!) string gather improvement in https://github.com/rapidsai/cudf/pull/7980/filesI think we want to have the vectorised copy as a reusable component. But I wanted to coordinate on what exactly that would look like first. Should this be (a) a warp-/block-level copy or should we (b) separate it into a warp-&block-level vectorised load (which will also have the async copy, maybe) and a warp-&block-level vectorised store?
Step 2.c: Enqueueing BLEV Buffers
These are buffers that may be very large. We want to avoid a scenario where there's potentially one very large buffer that a single thread block is copying while other thread blocks are sitting idle. To avoid this, BLEV buffers will be put into a queue that will be picked up in a subsequent kernel. In the subsequent kernel, the number of thred blocks getting assigned to each buffer is proportional to the buffer's size.