Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

Merged
merged 1 commit into from
Nov 13, 2023
Merged

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

merged 1 commit into from
Nov 13, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 13, 2023

Main update here is the new implementation of the convolution ops: ggml_conv_1d and ggml_conv_2d. They are now represented as a sequence of 2 other ops: ggml_im2col + ggml_mul_mat allowing to reuse the matrix multiplication code (more info ggerganov/ggml#564)

The convolutions can now be offloaded to the GPU (both CUDA and Metal supported), so likely with this change we should be able to support CLIP running fully on the GPU


🤖 Generated by Copilot at 9f72de7

Summary

🚀🐛🧹

This pull request adds and improves the im2col feature for half-precision tensors in the CUDA and Metal APIs, fixes various bugs and warnings in the code, and updates the public interface and the internal macros of the llama.cpp library. The im2col feature allows faster convolution operations on the GPU. The bug fixes and updates aim to enhance the correctness, efficiency, and compatibility of the library. The files affected by this pull request are ggml-cuda.cu, ggml-metal.m, ggml-metal.metal, ggml-quants.c, ggml.h, ggml-impl.h, and ggml-metal.h.

The pull request has many changes
To improve the ggml ranges
It adds im2col
And fixes some holes
In the macros and the half-precisions

Walkthrough

  • Implement im2col feature for half-precision tensors using CUDA and Metal (link, link, link, link, link, link, link, link, link, link, link)
  • Support matrix-vector multiplication on half-precision tensors using CUDA and Metal (link, link, link, link, link, link, link, link, link, link)
  • Fix bugs and improve performance in ggml_vec_dot_q2_K_q8_K, ggml_vec_dot_q3_K_q8_K, ggml_vec_dot_q4_K_q8_K, ggml_vec_dot_q5_K_q8_K, and ggml_vec_dot_q6_K_q8_K functions by using compatible types and functions for 64-bit architectures (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Fix bug in ggml_cuda_op_mul_mat function by checking the batch sizes of the input tensors (link)
  • Fix bug in ggml_metal_op_positional_encoding function by using the correct index for the n_orig_ctx parameter (link)
  • Fix typo in ggml_cuda_op_mul_mat function by using the correct pointer for the input tensor (link)
  • Fix compiler warning in ggml_metal_init function by casting the enum value to int (link)
  • Remove redundant definitions of MIN and MAX macros from ggml-impl.h file (link)
  • Remove redundant definitions of inline functions that emulate ARM NEON intrinsics from ggml-quants.c file (link, link)
  • Increase the value of GGML_METAL_MAX_BUFFERS macro from 16 to 64 in ggml-metal.h file (link)
  • Remove unused enum values from ggml_op enum in ggml.h file (link)
  • Add GGML_ATTRIBUTE_FORMAT macro to ggml_metal_log function declaration in ggml-metal.m file (link)
  • Add function name as the first argument to GGML_METAL_LOG_WARN macro in ggml-metal.m file (link)
  • Add space between variable and attribute qualifier in kernel_mul_mv_f16_f32_1row function in ggml-metal.metal file (link)
  • Add empty line to separate header inclusion section from the rest of the code in ggml-quants.c file (link)
  • Add more conditions to #if directive that checks if immintrin.h header should be included in ggml-quants.c file (link, link)
  • Add check for GGML_METAL_PATH_RESOURCES environment variable in ggml_metal_init function in ggml-metal.m file (link)
  • Add padding to threadgroup memory length arguments in ggml_metal_op_softmax function in ggml-metal.m file (link, link)
  • Comment out unused variable assignment in ggml-metal.m file (link)

@ggerganov ggerganov changed the title ggml : sync (im2col, GPU conv, 32-bit arm compat) sync : ggml (im2col, GPU conv, 32-bit arm compat) Nov 13, 2023
@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 13, 2023

support CLIP running fully on the GPU

Only the operation ggml_acc need a cuda and metal kernel 💀 for that. And the implementation of ggml_repeat in CUDA is slow, so there is still room for improvement in performance.

@ggerganov ggerganov merged commit 3d68f36 into master Nov 13, 2023
34 checks passed
@ggerganov ggerganov deleted the sync branch November 13, 2023 14:55
@ggerganov ggerganov mentioned this pull request Nov 13, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
@cebtenzzre cebtenzzre mentioned this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants