Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and fix Vulkan soft_max and argsort implementations #7237

Merged
merged 2 commits into from
May 18, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented May 12, 2024

I updated Vulkan for the changes in #7192 and fixed a bug in the soft_max implementation. That allowed me to clean up some code that was only needed for the three input tensor soft_max op.

I also updated and fixed the argsort implementation. Now test-backend-ops fully passes for the Vulkan backend.

@0cc4m 0cc4m requested a review from ggerganov May 12, 2024 07:00
@mofosyne mofosyne added Vulkan Issues specific to the Vulkan backend Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 12, 2024
@Adriankhl
Copy link
Contributor

Adriankhl commented May 12, 2024

Not sure if this is the right place to discuss, I am digging into the issue #7130

Here is the root cause:

Embedding computation always try to first allocate buffer with 0 size.

Because of size += TENSOR_ALIGNMENT, size is always bigger than 0 for cpu backend (not sure if this is the correct behaviour though). So cpu backend can always allocate a buffer successsfully.

llama.cpp/ggml-backend.c

Lines 625 to 631 in b228aba

GGML_CALL static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
size += TENSOR_ALIGNMENT; // malloc may return an address that is not aligned
void * data = malloc(size); // TODO: use GGML_ALIGNED_MALLOC (move to ggml-impl.h)
if (data == NULL) {
fprintf(stderr, "%s: failed to allocate buffer of size %zu\n", __func__, size);
return NULL;
}

For vulkan backend, ptr is still nullptr here after ggml_vk_host_malloc if size is 0.

llama.cpp/ggml-vulkan.cpp

Lines 6031 to 6043 in b228aba

GGML_CALL static ggml_backend_buffer_t ggml_backend_vk_host_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
#ifdef GGML_VULKAN_DEBUG
std::cerr << "ggml_backend_vk_host_buffer_type_alloc_buffer(" << size << ")" << std::endl;
#endif
void * ptr = nullptr;
try {
ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size);
} catch (vk::SystemError& e) {
std::cerr << "ggml_vulkan: Failed to allocate pinned memory." << std::endl;
std::cerr << "ggml_vulkan: " << e.what() << std::endl;
// fallback to cpu buffer
return ggml_backend_buft_alloc_buffer(ggml_backend_cpu_buffer_type(), size);
}

And because ggml_vk_host_malloc runs successfully, it doesn't throw an exception, which causes problem later on.

Should there be a null check here to throw an exception? Falling back to CPU buffer actually works despite the warning.

@mofosyne mofosyne added the bugfix fixes an issue or bug label May 12, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea before merging to run the 2 tests from the #7192 and verify that the output is reasonable

@Adriankhl
Copy link
Contributor

Not sure if this is the right place to discuss, I am digging into the issue #7130

Here is the root cause:

Embedding computation always try to first allocate buffer with 0 size.

Because of size += TENSOR_ALIGNMENT, size is always bigger than 0 for cpu backend (not sure if this is the correct behaviour though). So cpu backend can always allocate a buffer successsfully.

llama.cpp/ggml-backend.c

Lines 625 to 631 in b228aba

GGML_CALL static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
size += TENSOR_ALIGNMENT; // malloc may return an address that is not aligned
void * data = malloc(size); // TODO: use GGML_ALIGNED_MALLOC (move to ggml-impl.h)
if (data == NULL) {
fprintf(stderr, "%s: failed to allocate buffer of size %zu\n", __func__, size);
return NULL;
}

For vulkan backend, ptr is still nullptr here after ggml_vk_host_malloc if size is 0.

llama.cpp/ggml-vulkan.cpp

Lines 6031 to 6043 in b228aba

GGML_CALL static ggml_backend_buffer_t ggml_backend_vk_host_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
#ifdef GGML_VULKAN_DEBUG
std::cerr << "ggml_backend_vk_host_buffer_type_alloc_buffer(" << size << ")" << std::endl;
#endif
void * ptr = nullptr;
try {
ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size);
} catch (vk::SystemError& e) {
std::cerr << "ggml_vulkan: Failed to allocate pinned memory." << std::endl;
std::cerr << "ggml_vulkan: " << e.what() << std::endl;
// fallback to cpu buffer
return ggml_backend_buft_alloc_buffer(ggml_backend_cpu_buffer_type(), size);
}

And because ggml_vk_host_malloc runs successfully, it doesn't throw an exception, which causes problem later on.

Should there be a null check here to throw an exception? Falling back to CPU buffer actually works despite the warning.

Nevermind, the issue is much deeper than this. Please ignore it here

@0cc4m 0cc4m merged commit c1b295e into master May 18, 2024
60 checks passed
@0cc4m 0cc4m deleted the 0cc4m/soft-max-fix branch May 18, 2024 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants