Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

support different entry size for different ranks #194

Merged

Conversation

zhuofan1123
Copy link
Contributor

Allow users to specify the entry size on each rank.

    node_feat_wm_embedding = wgth.create_embedding(
        ...
        embedding_entry_partition=[283071, 401722, 356680, 329221, 238065, 238060, 217897, 384313]
    )
  1. embedding_entry_partition[i] indicates the number of embedding entries stored on the rank i.
  2. If embedding_entry_partition is None, embedding will be partitioned equally.
  3. Only chunked device and distributed host/device are supported.

Copy link

copy-pr-bot bot commented Jul 22, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@zhuofan1123 zhuofan1123 force-pushed the support_different_sizes branch from 64cfb03 to 35f192b Compare July 23, 2024 06:29
Copy link
Contributor

@linhu-nv linhu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work. Some comments are added in the codes.

@@ -478,15 +518,35 @@ def create_embedding_from_filelist(
)
total_file_size += file_size
total_entry_count = total_file_size // file_entry_size
if embedding_entry_partition is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can omit this because similar check will always be done in create_embedding()

@@ -283,8 +311,27 @@ def create_wholememory_tensor_from_filelist(
else:
sizes = [total_entry_count, last_dim_size]
strides = [last_dim_strides, 1]
if tensor_entry_partition is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, I think this could be omitted too, since we will do the check in create_wholememory_tensor()

cdef wholememory_error_code_t wholememory_tensor_get_entry_offsets(
size_t * entry_offsets, wholememory_tensor_t wholememory_tensor);

cdef wholememory_error_code_t wholememory_tensor_get_entry_partition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this function should be named as wholememory_tensor_get_entry_partition_sizes(), similar to that for wholememory embedding

if (mem_size_for_current_rank > 0) {
void* ptr = nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, i);
if (ptr != nullptr) {
register_wholememory_vma_range_locked(ptr, mem_size_for_current_rank, handle_);
}
ptr = nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we need this line?


WHOLEMEMORY_RETURN_ON_FAIL(
wholememory_get_rank_partition_offsets(host_embedding_entry_offsets_ptr, wholememory_handle));
for (int i = 0; i < world_size + 1; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps each process only need to do check of local memory

host_embedding_entry_offsets_ptr[i] /= embedding_entry_size;
}

WM_CUDA_CHECK(cudaMemcpy(dev_embedding_entry_offsets_ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to put cudaMemcpy() into $stream, instead of default stream. There are many calls to this function.


size_t element_size = wholememory_dtype_get_element_size(wholememory_desc.dtype);
size_t embedding_entry_size = element_size * wholememory_desc.stride;
for (int i = 0; i < world_size + 1; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe only check own rank is enough.


size_t element_size = wholememory_dtype_get_element_size(wholememory_desc.dtype);
size_t embedding_entry_size = element_size * wholememory_desc.stride;
for (int i = 0; i < world_size + 1; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only check one rank.

@@ -238,13 +238,16 @@ TEST_P(WholeMemoryEmbeddingParameterTests, EmbeddingGatherTest)
wholememory_tensor_description_t embedding_tensor_description;
wholememory_copy_matrix_desc_to_tensor(&embedding_tensor_description,
&params.embedding_description);

std::vector<size_t> rank_partition(world_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should also allow "random" and "default" partition, just like pytest.

@zhuofan1123 zhuofan1123 force-pushed the support_different_sizes branch 4 times, most recently from 1b6dd93 to 83fd16c Compare July 30, 2024 05:55
@linhu-nv
Copy link
Contributor

Seems good to me.

@BradReesWork
Copy link
Member

/okay to test

@BradReesWork BradReesWork added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 30, 2024
@zhuofan1123 zhuofan1123 force-pushed the support_different_sizes branch from 03c4bf3 to e550bc3 Compare July 30, 2024 13:25
@BradReesWork
Copy link
Member

/okay to test

@zhuofan1123 zhuofan1123 force-pushed the support_different_sizes branch from e550bc3 to 624c565 Compare July 31, 2024 05:15
@BradReesWork
Copy link
Member

/okay to test

@zhuofan1123 zhuofan1123 force-pushed the support_different_sizes branch from 624c565 to e661c4f Compare August 2, 2024 08:30
@zhuofan1123 zhuofan1123 requested review from a team as code owners August 2, 2024 08:30
@zhuofan1123 zhuofan1123 requested a review from AyodeAwe August 2, 2024 08:30
@zhuofan1123 zhuofan1123 changed the base branch from branch-24.08 to branch-24.10 August 2, 2024 08:36
@BradReesWork
Copy link
Member

/merge

@BradReesWork
Copy link
Member

/okay to test

@rapids-bot rapids-bot bot merged commit 91b7dcd into rapidsai:branch-24.10 Aug 2, 2024
48 checks passed
@zhuofan1123 zhuofan1123 deleted the support_different_sizes branch August 26, 2024 03:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants