-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate NVSHMEM into WholeGraph #91
Integrate NVSHMEM into WholeGraph #91
Conversation
fa6325a
to
cf273c4
Compare
wholememory_error_code_t wholememory_communicator_set_preferred_distributed_backend( | ||
wholememory_comm_t comm, wholememory_distributed_backend_t distributed_backend); | ||
|
||
wholememory_distributed_backend_t wholememory_communicator_get_preferred_distributed_backend( | ||
wholememory_comm_t comm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by preferred
here? Does this mean WholeMemory will try to set back-end based on the input argument but may end up setting the back-end to something else? If this is indeed the case, some documentation will help. If this functions returns on error code if the back-end can't be set as specified, then, better remove preferred
from the function name to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change preferred_distributed_backend
to distributed_backend
{ | ||
wholememory_error_code_t rc = WHOLEMEMORY_SUCCESS; | ||
// TODO: | ||
abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use abort
(is this std::abort
?) over std::exit
? std::abort
implies forced exit due to some errors. If this function is to force termination, better rename this function to ..._global_abort()
. Otherwise, why not call std::exit with the passed status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a boostrap plugin for nvshmem, in the new commit, we refer to the way the mpi boostrap plugin in nvshmem, which use comm_abort()
function of the comm.
for (int i = 0; i < comm_->world_size; i++) { | ||
size_t mem_size_of_this_rank_and_after = total_size_ - acc_size; | ||
size_t mem_size_for_current_rank = | ||
std::min(mem_size_of_this_rank_and_after, rank_partition_strategy_.partition_mem_stride); | ||
acc_size += mem_size_for_current_rank; | ||
uint64_t int_start_ptr = | ||
reinterpret_cast<uint64_t>(nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, i)); | ||
if (int_start_ptr == 0) continue; | ||
if (int_ptr >= int_start_ptr && int_ptr < int_start_ptr + mem_size_for_current_rank) { | ||
return true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have intention to modernize the codebase overtime, you may consider using std::any_of
. This better documents the intention of this code (if any_of int_ptr
satisfies the condition, return true, otherwise, return false).
ef5065d
to
9e405ec
Compare
Removing |
/merge |
1 similar comment
/merge |
I admin merged due to https://www.githubstatus.com/incidents/66vhjmd266r9 |
Support
gather
andscatter
operations with NVSHMEM when using thedistributed
embedding.The way to use
nvshmem
isIf
distributed_backend_type
is"nvshmem"
, then thegather
andscatter
operation for distributed wholememory created byglobal_comm
will be implemented withnvshmem
.Distributed embedding with
cache
is not supported when using nvshmem.