Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate NVSHMEM into WholeGraph #91

Merged

Conversation

chuangz0
Copy link
Contributor

Support gather and scatter operations with NVSHMEM when using the distributed embedding.

The way to use nvshmem is

#build command:  ./build.sh --enable-nvshmem ....


    global_comm, local_comm = wgth.init_torch_env_and_create_wm_comm(
        wgth.get_rank(),
        wgth.get_world_size(),
        wgth.get_local_rank(),
        wgth.get_local_size(),
        options.distributed_backend_type
    )

If distributed_backend_type is "nvshmem", then the gather and scatter operation for distributed wholememory created by global_comm will be implemented with nvshmem.
Distributed embedding with cache is not supported when using nvshmem.

@chuangz0 chuangz0 added feature request New feature or request non-breaking Introduces a non-breaking change labels Nov 15, 2023
@chuangz0 chuangz0 force-pushed the nvshmem_intergartion_sort_idx branch 2 times, most recently from fa6325a to cf273c4 Compare November 17, 2023 10:00
@BradReesWork BradReesWork added this to the 23.12 milestone Nov 17, 2023
cpp/include/wholememory/nvshmem_template.cuh Outdated Show resolved Hide resolved
Comment on lines 158 to 162
wholememory_error_code_t wholememory_communicator_set_preferred_distributed_backend(
wholememory_comm_t comm, wholememory_distributed_backend_t distributed_backend);

wholememory_distributed_backend_t wholememory_communicator_get_preferred_distributed_backend(
wholememory_comm_t comm);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by preferred here? Does this mean WholeMemory will try to set back-end based on the input argument but may end up setting the back-end to something else? If this is indeed the case, some documentation will help. If this functions returns on error code if the back-end can't be set as specified, then, better remove preferred from the function name to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change preferred_distributed_backend to distributed_backend

{
wholememory_error_code_t rc = WHOLEMEMORY_SUCCESS;
// TODO:
abort();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use abort (is this std::abort?) over std::exit? std::abort implies forced exit due to some errors. If this function is to force termination, better rename this function to ..._global_abort(). Otherwise, why not call std::exit with the passed status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a boostrap plugin for nvshmem, in the new commit, we refer to the way the mpi boostrap plugin in nvshmem, which use comm_abort() function of the comm.

Comment on lines +1183 to +1194
for (int i = 0; i < comm_->world_size; i++) {
size_t mem_size_of_this_rank_and_after = total_size_ - acc_size;
size_t mem_size_for_current_rank =
std::min(mem_size_of_this_rank_and_after, rank_partition_strategy_.partition_mem_stride);
acc_size += mem_size_for_current_rank;
uint64_t int_start_ptr =
reinterpret_cast<uint64_t>(nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, i));
if (int_start_ptr == 0) continue;
if (int_ptr >= int_start_ptr && int_ptr < int_start_ptr + mem_size_for_current_rank) {
return true;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have intention to modernize the codebase overtime, you may consider using std::any_of. This better documents the intention of this code (if any_of int_ptr satisfies the condition, return true, otherwise, return false).

@chuangz0 chuangz0 force-pushed the nvshmem_intergartion_sort_idx branch from ef5065d to 9e405ec Compare November 21, 2023 09:02
@chuangz0 chuangz0 requested a review from a team as a code owner November 21, 2023 09:02
@raydouglass raydouglass removed the request for review from a team November 27, 2023 19:36
@raydouglass
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@BradReesWork
Copy link
Member

/merge

1 similar comment
@BradReesWork
Copy link
Member

/merge

@raydouglass raydouglass merged commit 6b56361 into rapidsai:branch-23.12 Nov 27, 2023
43 checks passed
@raydouglass
Copy link
Member

I admin merged due to https://www.githubstatus.com/incidents/66vhjmd266r9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants