Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(extensions/nanoarrow_device): Draft DeviceArray interface #205

Merged
merged 104 commits into from
Jun 22, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 25, 2023

After:

Still in very much draft form; however, it does implement arbitrary ArrowArray copy to/from ARROW_DEVICE_METAL, ARROW_DEVICE_CUDA, ARROW_DEVICE_CUDA_HOST, and ARROW_DEVICE_CPU.

The nanoarrow_device extension as drafted here serves a similar purpose to nanoarrow: a means by which to create and consume the C ABI with the intention of shipping those structures to other libraries to do transformations, and potentially retrieving them again after the computation is complete. Perhaps another way to put it is that nanoarrow is designed to help at the edges: it can create and consume. Similarly, the nanoarrow_device extension is designed to help at the edges: it can copy/move arrays to and from CPU-land.

With this PR, you can currently do something like:

struct ArrowDevice* gpu = ArrowDeviceMetalDefaultDevice();
// Alternatively, ArrowDeviceCuda(ARROW_DEVICE_CUDA, 0)
// or  ArrowDeviceCuda(ARROW_DEVICE_CUDA_HOST, 0)
struct ArrowDevice* cpu = ArrowDeviceCpu();
struct ArrowArray array;
struct ArrowDeviceArray device_array;
struct ArrowDeviceArrayView device_array_view;

// Build a CPU array
ASSERT_EQ(ArrowArrayInitFromType(&array, NANOARROW_TYPE_STRING), NANOARROW_OK);
ASSERT_EQ(ArrowArrayStartAppending(&array), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendString(&array, ArrowCharView("abc")), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendString(&array, ArrowCharView("defg")), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendNull(&array, 1), NANOARROW_OK);
ASSERT_EQ(ArrowArrayFinishBuildingDefault(&array, nullptr), NANOARROW_OK);

// Convert to a DeviceArray, still on the CPU
ArrowDeviceArrayInit(&device_array, cpu);
ArrowArrayMove(&array, &device_array.array);

// Parse contents into a view that can be copied to another device
ArrowDeviceArrayViewInit(&device_array_view);
ArrowArrayViewInitFromType(&device_array_view.array_view, string_type);
ASSERT_EQ(ArrowDeviceArrayViewSetArray(&device_array_view, &device_array, nullptr),
          NANOARROW_OK);

// Try to zero-copy move to another device or copy if that is not possible. Zero-copy move
// is implemented for ARROW_DEVICE_METAL and ARROW_DEVICE_CUDA_HOST for the
// gpu -> cpu case.
struct ArrowDeviceArray device_array2;
device_array2.array.release = nullptr;
ASSERT_EQ(
    ArrowDeviceArrayTryMove(&device_array, &device_array_view, gpu, &device_array2),
    NANOARROW_OK);

In concrete terms, that means we to know enough about a device to (1) copy and/or move an arbitrary ArrowArray/ArrowSchema pair to a device from the CPU and (2) copy/move an arbitrary ArrowDeviceArray/ArrowSchema pair back to the CPU. The three types of copying I support (and maybe there could be fewer/need to be more) are:

  • ArrowDeviceBufferInit(): Make a non-owning buffer into an owning buffer on a device. The entry point if you want to take a slice of an ArrowArrayView and ship it to a device.
  • ArrowDeviceBufferMove(): Move an existing (owning) buffer to a device. For devices like the CPU, this is a true zero-copy move; for shared memory this can also sometimes be zero copy (e.g., Apple Metal -> CPU) but might also involve a copy.
  • ArrowDeviceBufferCopy(): Copy a section of a buffer into a preallocated section of another buffer. I'm envisioning this to be necessary when copying a String, Binary, List...we need the first and last values of the offsets buffer in order to know what portion of the data buffer to copy. It seems unnecessary to copy 4 bytes of a buffer into an owning variant covered by the first bullet but 🤷 .

This PR currently provides support for the CPU device, Apple Metal, CUDA, and CUDA_HOST (i.e., CPU memory that has been registered with CUDA which CUDA copies under the hood).

@paleolimbot paleolimbot marked this pull request as draft May 25, 2023 17:39
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2023

Codecov Report

Merging #205 (3b5ee6c) into main (e811cfe) will decrease coverage by 0.48%.
The diff coverage is 71.26%.

@@            Coverage Diff             @@
##             main     #205      +/-   ##
==========================================
- Coverage   87.64%   87.17%   -0.48%     
==========================================
  Files          63       66       +3     
  Lines        9789    10061     +272     
==========================================
+ Hits         8580     8771     +191     
- Misses       1209     1290      +81     
Impacted Files Coverage Δ
src/nanoarrow/array.c 91.42% <0.00%> (-0.60%) ⬇️
src/nanoarrow/array_inline.h 89.72% <ø> (ø)
src/nanoarrow/nanoarrow.h 100.00% <ø> (ø)
.../nanoarrow_device/src/nanoarrow/nanoarrow_device.c 66.35% <66.35%> (ø)
.../nanoarrow_device/src/nanoarrow/nanoarrow_device.h 100.00% <100.00%> (ø)
...anoarrow_device/src/nanoarrow/nanoarrow_device.hpp 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@paleolimbot paleolimbot force-pushed the device-array-ext branch 2 times, most recently from 3c95a4d to dbb4404 Compare June 9, 2023 17:01
@paleolimbot paleolimbot marked this pull request as ready for review June 9, 2023 17:32
extensions/nanoarrow_device/CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines 22 to 27
static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
uint8_t* ptr, int64_t old_size) {
if (ptr != NULL) {
cudaFree(ptr);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most GPU libraries / frameworks have their own memory pool / memory management implementations that are often asynchronous (and are ordered by CUDA streams) where they won't be able to benefit from this implementation. This is generally true for most operations: free, alloc, realloc, memset, memcpy, etc.

I'm not sure if we need an actual implementation to live within nanoarrow or if we can just define an interface for downstream libraries to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An early version of ArrowDeviceBufferXXX() functions had a sync_event* argument, which I removed before I saw cudaMemcpyAsync and friends in the documentation. I don't know if that's the perfect interface, but the part of nanoarrow's generic "copy this array to the device" implementation would benefit a lot since essentially all of those buffers can be copied in parallel.


// Pointer vs. not pointer...is there memory ownership to consider here?
cudaEvent_t* cuda_event = (cudaEvent_t*)sync_event;
cudaError_t result = cudaEventSynchronize(*cuda_event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most situations you'd want to use cudaStreamWaitEvent as opposed to this API as its much more efficient and doesn't unnecessarily block the CPU until things are done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that is what a library doing gpu--gpu calculations would do...here I think it does need to be the blocking version (this is the method that is called before an arbitrary ArrowDeviceArray or a slice of it is copied back to the CPU).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To call cudaStreamWaitEvent you'd need to know what stream to wait on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function is intended to do the bad thing and block until it's safe to do CPU things. If there's a way to avoid the sync before copying back to the CPU it could be removed.

// specific language governing permissions and limitations
// under the License.

#include <cuda_runtime_api.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd likely be better off using the CUDA driver API here instead of the runtime API as there's much stronger forward compatibility guarantees as well as easier deployment (someone can have the driver installed but not the CUDA runtime, but not the reverse).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like a better fit (although may require implementing some reference counting of our own).

Comment on lines +86 to +87
// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
// no good way to incorporate that just yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth: this will likely be a blocker for most libraries / frameworks to be able to utilize things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a technical limitation for making ArrowDeviceArrayViewCopy() return after having kicked off all the buffer copies and populating the ArrowDeviceArray's sync_event...this is mostly a personal limitation (steep learning curve for me).

Comment on lines 251 to 255
} else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
device_dst->device_type == ARROW_DEVICE_CUDA_HOST &&
device_src->device_id == device_dst->device_id) {
// Move
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle the situations where the src is ARROW_DEVICE_CUDA_HOST and dst is ARROW_DEVICE_CUDA and vice versa?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see cuMemHostGetDevicePointer(), so I assume this is possible. I think it would require that ArrowDeviceArrayViewCopy() has a device-specific implementation (probably for the best anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuMemHostGetDevicePointer() gets a pointer to that pinned host memory that can be used from device code but doesn't actually copy any memory to device memory. As far as I know it can be used anywhere that device memory can be used, but obviously has different performance characteristics where that would likely be very unexpected.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

Comment on lines +304 to +206
/// Creates a new buffer whose data member can be accessed by the GPU by
/// moving an existing buffer. If NANOARROW_OK is returned, src will have
/// been released or moved by the implementation and dst must be released by
/// the caller.
/// Implementations must check device_src and device_dst and return ENOTSUP if
/// not prepared to handle this operation.
ArrowErrorCode (*buffer_move)(struct ArrowDevice* device_src, struct ArrowBuffer* src,
struct ArrowDevice* device_dst, struct ArrowBuffer* dst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify semantics about the src and dst devices such as the src being CPU memory? etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that an implementation might be able to handle a few directions, although it does result in verbose method implementations. For the CUDA case the generality is somewhat useful...it can also theoretically move a buffer from CUDA_HOST to CUDA and it might be difficult to construct a method signature that captures that. That generality might also not be useful 🤷

Comment on lines 313 to 217
/// \brief Copy a section of memory into a preallocated buffer
///
/// As opposed to the other buffer operations, this is designed to support
/// copying very small slices of memory.
/// Implementations must check device_src and device_dst and return ENOTSUP if
/// not prepared to handle this operation.
ArrowErrorCode (*buffer_copy)(struct ArrowDevice* device_src,
struct ArrowDeviceBufferView src,
struct ArrowDevice* device_dst,
struct ArrowDeviceBufferView dst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, should we put specifics as to the semantics of the device types? Should implementations have to check the device type every time for both source and destination or have to implement multiple devices?

Comment on lines +279 to +281
static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device,
struct ArrowBufferView buffer_view,
int64_t i, int32_t* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an equivalent that doesn't do the copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand where that would be used? It's definitely suboptimal to issue copies in this way (but your suggestion of skipping validation and Keith's suggestion of leveraging async memcpy may be a workaround).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be useful for device code to be able to access the value of a specific index of the array without having to perform the copy. It also allows CPU code to find the address of the specific index (pointer into non-cpu memory) that can then be used for whatever is necessary on the device side without needing to copy the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.

I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated the function name to ArrowDeviceArrayViewResolveBufferSizes() since that's what it's actually doing and added ArrowDeviceArrayViewSetArrayMinimal() that just sets array_view->buffers[i].size_bytes to -1 if it would require a copy to calculate.

For the case of "just get me the pointer value", I don't think there needs to a be a function (array_view->buffers[i].data.as_int32 + some_logical_offset would do it).

For the case where we copy back to the CPU, I don't see a way around copying the last int32/int64 from the offsets buffer (or else there is no way to know how many bytes of the next buffer to copy). We can possibly mitigate the impact of that by asynchronously kicking off all the tiny copies at once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, issuing two single element copies for getting the starting and ending offset into the CPU is necessary, but once we have those offsets, we should just do pointer arithmetic to get a pointer and a size to feed into a copy call.

Ideally you want to issue the two individual element copies asynchronously (potentially on different streams so they can be overlapped even though they're tiny), synchronize the stream(s) since you need those values to use in host code, and then issue the actual data copy.

if (NANOARROW_DEVICE_WITH_CUDA)
find_package(CUDAToolkit REQUIRED)
set(NANOARROW_DEVICE_SOURCES_CUDA src/nanoarrow/nanoarrow_device_cuda.c)
set(NANOARROW_DEVICE_LIBS_CUDA CUDA::cudart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to continue using the runtime, may want to use the static library instead: CUDA::cudart_static.

Would still recommend using the driver library though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driver library is a definite yes (just haven't gotten there yet).

Comment on lines +279 to +281
static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device,
struct ArrowBufferView buffer_view,
int64_t i, int32_t* out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.

I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.

return NANOARROW_OK;
}

static ArrowErrorCode ArrowDeviceBufferGetInt64(struct ArrowDevice* device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the int32 function here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is handled by array_view->buffer_views[i].data.as_int32 + some_index (which would get you the pointer to an element of a buffer).

Comment on lines 402 to 404
// Wait on device_array to synchronize with the CPU
NANOARROW_RETURN_NOT_OK(device->synchronize_event(ArrowDeviceCpu(), device,
device_array->sync_event, error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to synchronize on the event here? This function ultimately is just responsible for setting the points in the array view from the passed in array, correct?

If so, synchronize guarantees that the data underneath the pointer is synchronized, but doesn't impact the pointers themselves at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I correctly separated the case where it's needed (it is needed before copy to CPU, correct? Or is that synchronizatio handled by cudaMemcpy()?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all variants of cudaMemcpy are stream ordered so you can safely use them without synchronizing the stream, but the destination data is stream ordered as well, so if you're going to operate on it from a different stream or from host code then you need to synchronize the stream in some kind of way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we do need to do the bad thing here and wait for a sync before calling cudaMemcpy() here for the GPU -> CPU direction (although hopefully this is now isolated such that it won't get accidentally called by somebody who does not need this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should sync after calling the cudaMemcpy() as opposed to before it. Otherwise, you could in theory get into a situation where that device to host copy is asynchronous (if you have pinned host memory for example) and accessing it from the CPU without synchronization is a race condition.

Comment on lines 442 to 443
NANOARROW_RETURN_NOT_OK(ArrowDeviceBufferInit(device_src, buffer_view_src, device_dst,
ArrowArrayBuffer(dst, i)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this buffer initialization is asynchronous then we need to set a synchronization event somewhere I think?

@paleolimbot
Copy link
Member Author

paleolimbot commented Jun 22, 2023

Ok! There are definitely some holes in this implementation (notably around properly synchronizing memory copies). I'd propose that this PR get merged (and clearly marked as under development/experimental in the README) with some related changes grouped into some follow-ups:

I think it's still an open question as to whether or not this particular extension will be used/will be useful...if there is no interest in using it before the next release it can always be excluded from the source release (like the Python bindings currently are) or moved back to a PR state.

@paleolimbot paleolimbot merged commit 086793a into apache:main Jun 22, 2023
@paleolimbot paleolimbot deleted the device-array-ext branch September 19, 2023 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants