sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main update here is the new implementation of the convolution ops:
ggml_conv_1d
andggml_conv_2d
. They are now represented as a sequence of 2 other ops:ggml_im2col + ggml_mul_mat
allowing to reuse the matrix multiplication code (more info ggerganov/ggml#564)The convolutions can now be offloaded to the GPU (both CUDA and Metal supported), so likely with this change we should be able to support CLIP running fully on the GPU
🤖 Generated by Copilot at 9f72de7
Summary
🚀🐛🧹
This pull request adds and improves the im2col feature for half-precision tensors in the CUDA and Metal APIs, fixes various bugs and warnings in the code, and updates the public interface and the internal macros of the llama.cpp library. The im2col feature allows faster convolution operations on the GPU. The bug fixes and updates aim to enhance the correctness, efficiency, and compatibility of the library. The files affected by this pull request are
ggml-cuda.cu
,ggml-metal.m
,ggml-metal.metal
,ggml-quants.c
,ggml.h
,ggml-impl.h
, andggml-metal.h
.Walkthrough
ggml_vec_dot_q2_K_q8_K
,ggml_vec_dot_q3_K_q8_K
,ggml_vec_dot_q4_K_q8_K
,ggml_vec_dot_q5_K_q8_K
, andggml_vec_dot_q6_K_q8_K
functions by using compatible types and functions for 64-bit architectures (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)ggml_cuda_op_mul_mat
function by checking the batch sizes of the input tensors (link)ggml_metal_op_positional_encoding
function by using the correct index for then_orig_ctx
parameter (link)ggml_cuda_op_mul_mat
function by using the correct pointer for the input tensor (link)ggml_metal_init
function by casting the enum value to int (link)MIN
andMAX
macros fromggml-impl.h
file (link)ggml-quants.c
file (link, link)GGML_METAL_MAX_BUFFERS
macro from 16 to 64 inggml-metal.h
file (link)ggml_op
enum inggml.h
file (link)GGML_ATTRIBUTE_FORMAT
macro toggml_metal_log
function declaration inggml-metal.m
file (link)GGML_METAL_LOG_WARN
macro inggml-metal.m
file (link)kernel_mul_mv_f16_f32_1row
function inggml-metal.metal
file (link)ggml-quants.c
file (link)#if
directive that checks ifimmintrin.h
header should be included inggml-quants.c
file (link, link)GGML_METAL_PATH_RESOURCES
environment variable inggml_metal_init
function inggml-metal.m
file (link)ggml_metal_op_softmax
function inggml-metal.m
file (link, link)ggml-metal.m
file (link)