A small header-only library for CUDA block processing.
The purpose of cudablockproc is to make it simple to work with 3D datasets which are too large to fit in GPU memory. You simply provide a function, a collections of in- and output volumes and a block size; cudablockproc handles the rest. Specifically, the volumes are split into blocks which are then copied to the GPU and passed to your function as pointers. After processing, the output blocks are placed back in the output volumes.
The library itself is header only. To use it, copy the contents of the lib
directory to your project.
To compile the tests you can (maybe) use the provided makefile - however you will likely have to modify some of the paths.
The tests are written using the googletest library.
These examples show how the library can be used. For all examples, assume that these utility functions are defined:
inline __host__ __device__ size_t getIdx(const int x, const int y, const int z, const int3 siz)
{
return x + y * siz.x + z * siz.x * siz.y;
}
inline __host__ __device__ size_t getIdx(const int3 pos, const int3 siz)
{
return getIdx(pos.x, pos.y, pos.z, siz);
}
inline size_t gridLineBlocks(size_t nthr, const unsigned int siz)
{
return siz/nthr + ((siz % nthr != 0) ? 1 : 0);
}
inline dim3 gridBlocks(const dim3 thrConfig, const int3 siz)
{
return dim3(
gridLineBlocks(thrConfig.x, siz.x),
gridLineBlocks(thrConfig.y, siz.y),
gridLineBlocks(thrConfig.z, siz.z)
);
}
This example simply adds 1.0f
to every voxel in the input volume and places the results in the output volume.
Let the following kernel be defined:
__global__ void addOneKernel(float *out, const float *in, const int3 size)
{
const int3 pos = make_int3(threadIdx);
if (pos >= 0 && pos < size) {
out[getIdx(pos,size)] = in[getIdx(pos,size)] + 1.0f;
}
}
The block processing code is:
// Specify sizes
const int3 volSize = make_int3(10,10,10);
const int3 blockSize = make_int3(2,2,2);
const size_t numElemVol = volSize.x * volSize.y * volSize.z;
// Alloc host volumes
std::vector<float> volIn(numElemVol);
std::vector<float> volOut(numElemVol);
// Prepare function
auto addOne = [](auto b, auto s, auto in, auto out, auto tmp){
// Get size of block
const int3 siz = b.blockSize();
// Set up compute grid
const dim3 blockDim = dim3(8,8,8);
const dim3 gridDim = gridBlocks(blockDim, siz);
// Launch kernel
addOneKernel<<<gridDim,blockDim,0,s>>>(out[0], in[0], siz);
};
// Do block processing
cbp::BlockIndexIterator blockIter(volSize, blockSize);
cbp::CbpResult res = cbp::blockProc(addOne, volIn.data(), volOut.data(), blockIter);
This example sums the voxels in a 3x3x3 window and places the results in the output volume. Let the following kernel be defined:
__global__ void sum3x3x3Kernel(float *out, const float *in, const int3 size)
{
const int3 pos = make_int3(threadIdx);
if (pos >= 0 && pos < size) {
out[getIdx(pos,size)] = 0;
}
// Loop over 3x3x3 window
if (pos >= 1 && pos < size - 1) {
for (int ix = -1; ix <= 1; ix++) {
for (int iy = -1; iy <= 1; iy++) {
for (int iz = -1; iz <= 1; iz++) {
out[getIdx(pos,size)] += in[getIdx(pos + make_int3(ix,iy,iz),size)];
}
}
}
}
}
The block processing code is:
// Specify sizes
const int3 volSize = make_int3(10,10,10);
const int3 blockSize = make_int3(2,2,2);
const int3 borderSize = make_int3(1,1,1);
const size_t numElemVol = volSize.x * volSize.y * volSize.z;
// Alloc host volumes
std::vector<float> volIn(numElemVol);
std::vector<float> volOut(numElemVol);
// Prepare function
auto sum3x3x3 = [](auto b, auto s, auto in, auto out, auto tmp){
// Get size of block including the border
const int3 siz = b.blockSizeBorder();
// Setup compute grid
const dim3 blockDim = dim3(8,8,8);
const dim3 gridDim = gridBlocks(blockDim, siz);
// Launch kernel
sum3x3x3Kernel<<<gridDim,blockDim,0,s>>>(out[0], in[0], siz);
};
// Do block processing
cbp::BlockIndexIterator blockIter(volSize, blockSize, borderSize); // NOTE: We supply a borderSize too
cbp::CbpResult res = cbp::blockProc(sum3x3x3, volIn.data(), volOut.data(), blockIter);
You can supply any type of callable as a function as long as it can be called with the following signature:
RTy func(BlockIndex bi, cudaStream_t s, std::vector<InTy> inBlocks, std::vector<OutTy> outBlocks, void *tmp)
where:
RTy
is the return type of the funcion. Note, any return value is ignored.bi
is the currentBlockIndex
which gives the position and size of the current block. More on this in the next section.s
is the cuda stream associated to this block. If your function is asynchronous (i.e. launches kernels) you must supply this stream, see the section on asynchronous functions.inBlocks
anstd::vector
containing pointers to the input blocks. The order of the blocks is the same as the order of the input volumes. Their size is given bybi
.outBlocks
anstd::vector
containing pointers to the output blocks. The order of the blocks is the same as the order of the output volumes. Their size is given bybi
.tmp
is a pointer to temporary device memory which won't be copied back to the host.
The function must run on the host, so any kernels should be launced by you - cudablockproc does not launch kernels.
In order to reduce the overhead of block processing, cudablockproc will attempt to move data between the volumes to / from a set of page-locked (pinned) host buffers while your function runs. In order to do this, the transfer to and from the GPU is done asynchronously using CUDA streams to ensure the correct order of operations. Specifically, the order is:
- Copy block from input volume to host input buffer
- Copy input host block to device input block (async)
- Your function runs (maybe async)
- Copy device block to host output buffer (async)
- Copy host output buffer block to output volume
Thus, when launching kernels (or other async work) you must supply the provided CUDA stream
to ensure no memory transfers occur while your functions runs. If you do no async work, then call cudaStreamSynchronize
at the beginning of your function to ensure all data has been transfered to the device before processing.
You specify the size of the blocks by making a BlockIndexIterator
, where you supply:
- The size of the volumes. Currently, the library assumes all volumes have the same size.
- The size of the blocks.
- (Optional) The size of the block border. This specifies the amount of overlap between the blocks which is useful if your functions processes a small window of voxels - e.g. for smoothing. By default, a border of 0 voxels is assumed.
You can iterate over a BlockIndexIterator
like a normal iterator - e.g. using begin
and end
.
When it is dereferenced it returns a BlockIndex
which contains the start- and end xyz-indices for the current block -
both with and without borders.
You can also get the size of the current block by calling BlockIndex::blockSize
or BlockIndex::blockSizeBorder
to include the
size of the border.