Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_COPY: add multi-device support in cuda_copy #9645

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Jan 30, 2024

What/Why?

Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.

How ?

CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.

TODO

  1. Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context (not going to cleanup resources and leave it to the OS to handle it)
  2. Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync (removed)

ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
{
CUstream *stream = NULL;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
CUstream *stream = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NULL assignment necessary? Line 70 will always overwrite it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will leave it uninitialized.

} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code (lines 128-137) is repeated in put_short and get_short as well. Maybe put it in a function or macro?

UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return error or attempt to set the context as we have the buffers? Though, it may not be in the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

static UCS_CLASS_INIT_FUNC(uct_cuda_copy_iface_t, uct_md_h md, uct_worker_h worker,
const uct_iface_params_t *params,
const uct_iface_config_t *tl_config)
void uct_cuda_copy_cleanup_per_ctx_rscs(uct_cuda_copy_per_ctx_rsc_t *ctx_rsc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this can be a static function.

UCS_BITMAP_CLEAR(&self->streams_to_sync);
}

ucs_status_t uct_cuda_copy_init_per_ctx_rscs(uct_cuda_copy_iface_t *iface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this can be a static function.
Also, I think this function does not need the iface. The max_cuda_events value can be passed directly as an argument instead. If you decide to keep passing the iface, then let's add const.


return UCS_OK;
}

static UCS_CLASS_CLEANUP_FUNC(uct_cuda_copy_iface_t)
ucs_status_t uct_cuda_copy_get_ctx_rscs(uct_cuda_copy_iface_t *iface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have _per_ in the name to be consistent with init_per_ctx_rscs and cleanup_per_ctx_rscs functions?

UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
}

ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this mpool cleanup be called even if the ctx_rsc->cuda_ctx is not valid anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move cuCtxPopCurrent(&cuda_popped_ctx); to here? Because we want to pop the pushed context regardless of the success or failure of cuMemGetAddressRange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. @brminich made the same suggestion too.

Comment on lines 277 to 289
/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be a common inline function to get current ctx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to inline function.


/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (current_ctx == NULL) {
if (ucs_unlikely(current_ctx == NULL)) {

ucs_memory_type_t src, dst;
ucs_mpool_params_t mp_params;
unsigned long long ctx_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned long long ctx_id;
unsigned long long ctx_id;

Comment on lines 377 to 391
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
if (cu_err != CUDA_SUCCESS) {
cuCtxPopCurrent(&cuda_popped_ctx);
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}

cuCtxPopCurrent(&cuda_popped_ctx);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);
cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
if (cu_err != CUDA_SUCCESS) {
cuCtxPopCurrent(&cuda_popped_ctx);
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}
cuCtxPopCurrent(&cuda_popped_ctx);
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);
cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
cuCtxPopCurrent(&cuda_popped_ctx);
if (cu_err != CUDA_SUCCESS) {
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll make these changes today.

ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
{
CUstream *stream = NULL;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
CUstream *stream = NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will leave it uninitialized.

UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

Comment on lines 277 to 289
/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to inline function.

UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
}

ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. @brminich made the same suggestion too.

@Akshay-Venkatesh Akshay-Venkatesh marked this pull request as ready for review February 22, 2024 20:52
@Akshay-Venkatesh
Copy link
Contributor Author

@brminich I see one of the commits had an extra colon and 2 commit style tests are failing because of that. Would it be ok to rebase? I can wait to do this until all the reviewers have had a chance to look at my comments and code changes.

cc @rakhmets @SeyedMir

@SeyedMir
Copy link
Contributor

@Akshay-Venkatesh Rebase is fine with me.

@brminich
Copy link
Contributor

@Akshay-Venkatesh, no problem from my side

Copy link
Contributor

@rakhmets rakhmets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase is OK for me too.

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/cuda-copy-multi-dev branch from bb7c190 to fb1d3be Compare February 26, 2024 19:02
@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Feb 28, 2024

@brminich @rakhmets @SeyedMir

FYI, in dd8b66d I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global).

Also, it looks like cuCtxGetId is supported for CUDA >=12.0. Without context ID, we don't have a way to query which context we're trying to use and pick associated stream/event resources for transport operations. We cannot use CUcontext handle itself instead of context ID because we cannot assume that the handle returned by say cuCtxGetCurrent will always return the same handle as opposed to a handle that has the same properties. So it seems that multi-device support will need CUDA >= 12.0. We should discuss more about this.

} uct_cuda_copy_per_ctx_rsc_t;


KHASH_MAP_INIT_INT64(cuda_copy_ctx_rscs, struct uct_cuda_copy_per_ctx_rsc*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can store uct_cuda_copy_per_ctx_rsc (not the pointer), then you would need to do alloc/free during put.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich Will incorporate this change.

CUcontext cuda_ctx;
unsigned long long ctx_id;
/* pool of cuda events to check completion of memcpy operations */
ucs_mpool_t cuda_event_desc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to have this mpool per context? Maybe one common mpool is enough?

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Event and stream resources are associated with a context. We do need an mpool for each context.

@rakhmets rakhmets force-pushed the topic/cuda-copy-multi-dev branch from aebede3 to af3206a Compare February 4, 2025 13:27
@@ -46,32 +46,31 @@ UCS_CLASS_DEFINE_DELETE_FUNC(uct_cuda_copy_ep_t, uct_ep_t);

ucs_status_t uct_cuda_copy_init_stream(CUstream *stream)
{
if (*stream != 0) {
if (*stream != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also make this func inline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlikely


status = UCT_CUDADRV_FUNC_LOG_ERR(
cuEventRecord(cuda_event->event, *stream));
status = UCT_CUDADRV_FUNC_LOG_ERR(cuEventRecord(cuda_event->event, stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that cuMemcpyAsync fails, but cuEventRecord succeeds?
status will b eoverwritten then

@@ -46,32 +46,31 @@ UCS_CLASS_DEFINE_DELETE_FUNC(uct_cuda_copy_ep_t, uct_ep_t);

ucs_status_t uct_cuda_copy_init_stream(CUstream *stream)
{
if (*stream != 0) {
if (*stream != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlikely

Comment on lines 535 to 559
if (uct_cuda_base_context_match(cuda_context, self->cuda_context)) {

ucs_memory_type_for_each(src) {
ucs_memory_type_for_each(dst) {
stream = &self->queue_desc[src][dst].stream;
event_q = &self->queue_desc[src][dst].event_queue;

if (!ucs_queue_is_empty(event_q)) {
ucs_warn("stream destroyed but queue not empty");
}

if (*stream == 0) {
continue;
}

UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(*stream));
}
}

if (self->short_stream) {
UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(self->short_stream));
}
}
kh_foreach_value(&self->ctx_rscs, ctx_rsc, {
ucs_mpool_cleanup(&ctx_rsc.cuda_event_desc, 1);
});

ucs_mpool_cleanup(&self->cuda_event_desc, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to cleanup

Comment on lines 158 to 169
int num_devices;
ASSERT_EQ(cudaGetDeviceCount(&num_devices), cudaSuccess);

if (num_devices < 2) {
UCS_TEST_SKIP_R("less than two cuda devices available");
}

uct_p2p_rma_test::test_xfer(send, length, flags, mem_type);

int current_device;
ASSERT_EQ(cudaGetDevice(&current_device), cudaSuccess);
ASSERT_EQ(cudaSetDevice((current_device + 1) % num_devices), cudaSuccess);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead can we run xfer tests on all avail cuda devices? (up to 2)? instead of adding another test class
maybe even using c++ iterator

@rakhmets rakhmets added WIP-DNM Work in progress / Do not review and removed Ready for Review labels Mar 4, 2025
@rakhmets rakhmets force-pushed the topic/cuda-copy-multi-dev branch from e1ab8ce to 6786655 Compare March 6, 2025 19:33
}
#endif

status = uct_cuda_copy_ctx_create_validator(ctx, &ctx_rsc->validator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep validator just in mpool_priv and initialize it after mpool creation?
why is it needed to also keep it in the context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed extra field.

brminich
brminich previously approved these changes Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP-DNM Work in progress / Do not review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants