Why do vision models need specific surgery files? #11139

bartowski1182 · 2025-01-08T14:31:14Z

bartowski1182
Jan 8, 2025

For example, then recent Qwen2VL implementation.. Everything within the qwen2_vl_surgery.py is.. just python code, is there a reason it couldn't be added to convert_hf_to_gguf.py? We are already detecting the architecture is qwen2vl, seems simple enough to add all the surgery code into that block and have both be done in one, especially if we add an optional --vision-adapter param that's ignored for non-vision models and used with vision ones to specify that the vision adapter should be made.

I ask because I was considering making the change, but if there's a specific reason that it hasn't been done that way I won't bother, maybe I'm missing something

danbev · 2025-01-08T15:14:53Z

danbev
Jan 8, 2025
Collaborator

I could me mistaken here but I was under the impression that in the future it would be possible to convert a model which contains both a vision encoder and a language model into a single .gguf. And when the new vision api is in place it would then be possible to have a single .gguf file containing both. This is what I did for Llama 3.2 Vision in this branch.

So perhaps if the above is true it might make sense to hold off on moving this to convert_hf_to_gguf.py.

12 replies

ngxson Jan 20, 2025
Collaborator

Having CLIP / LLM separated can be useful in the case where we can reuse the same CLIP weight for multiple LLMs. But AFAIK this is not the case in reality. And even if it is, user must have a deep knowledge about why they can do that.

ggerganov Jan 20, 2025
Maintainer

One thing we can do is to have a single GGUF file with both the encoders and the decoders and be able to create different contexts from it. We just need to introduce a enum llama_context_type to differentiate the contexts:

model = llama_model_load_from_file(...);

ctx_enc = llama_init_from_model_with_type(model, params_enc, LLAMA_CONTEXT_ENC);
ctx_dec = llama_init_from_model_with_type(model, params_dec, LLAMA_CONTEXT_DEC);

This way we can generate a single GGUF and at the same time be able to create separate contexts for the encoders and the decoders. And in theory, this would also support separate model files that would contain either just the encoder or just the decoder if we wanted to.

danbev Jan 20, 2025
Collaborator

Sounds good, I'll go with the single .gguf for the mllama model.

ngxson Jan 20, 2025
Collaborator

Yeah that looks good. I totally agree that they should be on different contexts. I'll have to do the same thing on my llama_vision implementation. Thought about it before but I was quite lazy to split them up in this early development stage.

ggerganov Jan 20, 2025
Maintainer

After the refactoring in #11213 it should be much easier to implement new types of llama_context. Each context would contain just the members needed for it. For example, an encoder-only context would not have a KV cache at all and each context would have it's own scheduler and buffers, etc. This will allow us to reserve worst-case graphs separately for the encoder and the decoder, which is not possible on master.

danbev · 2025-01-14T06:37:51Z

danbev
Jan 14, 2025
Collaborator

@bartowski1182 Sorry about misleading you on this. It turned out my assumptions were incorrect.

So I think your idea of adding a --vision-adapter option to convert_hf_to_gguf.py makes sense. Would using this option then generated two models if the model that is being converted is multi-modal model and contains a vision encoder?

0 replies

ngxson · 2025-01-20T11:23:20Z

ngxson
Jan 20, 2025
Collaborator

To answer you original question @bartowski1182 , the reason why we need these "surgery" scripts was because historically, the llava example was developed outside of the main llama.cpp and convert_hf_to_gguf infrastructure. For this reason, all the scripts related to llava/mobilevlm/qwenvl/etc are under /examples and not /gguf-py or /scripts

convert_hf_to_gguf expect itself to be able to convert all tensors from HF model into GGUF. But because it doesn't have any idea how to convert vision tensors, it will throw an error on vision tensors. That is why "sugery" come into handy: it removes these "unknown" tensors so convert_hf_to_gguf will be happy again. But clearly, this is just a temporary solution.

Of course, we can modify convert_hf_to_gguf so that it recognizes these vision tensors. That's what I'm doing in #11292 , so you can expect convert_hf_to_gguf to "just work" without sugery in a near future.

0 replies

tc-mb · 2025-02-06T08:12:36Z

tc-mb
Feb 6, 2025

I would like to talk about my current understanding here.

I suggest that the scripts for converting models can be merged. However, using two ggufs for llm and vision separately can better perform quantization of different precisions.This idea is different from designing from the code structure.

I agree that llamacpp can well support the performance of the llm part at different precisions. However, the multimodal model uses another network to extract features to obtain the embed result, and this process is more sensitive to some disturbances. I observed that if a picture is compressed to a smaller number of tokens, the impact on the quantization of the vision part will be more obvious, much greater than the impact of llm.

This means that if the vision part and the llm part are placed in one gguf:

Using the same quantization method, users will get a cpp effect that is somewhat inferior to python in multimodal performance.
Different parts use different quantization methods, and different quantization settings be set on gguf, which actually makes gguf conversion and viewing more complicated.

Of course, I just expressed a cognition that is not easy to notice. If there is an elegant way to merge, I also support using one gguf file.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do vision models need specific surgery files? #11139

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why do vision models need specific surgery files? #11139

bartowski1182 Jan 8, 2025

Replies: 4 comments · 12 replies

danbev Jan 8, 2025 Collaborator

ngxson Jan 20, 2025 Collaborator

ggerganov Jan 20, 2025 Maintainer

danbev Jan 20, 2025 Collaborator

ngxson Jan 20, 2025 Collaborator

ggerganov Jan 20, 2025 Maintainer

danbev Jan 14, 2025 Collaborator

ngxson Jan 20, 2025 Collaborator

tc-mb Feb 6, 2025

bartowski1182
Jan 8, 2025

Replies: 4 comments 12 replies

danbev
Jan 8, 2025
Collaborator

ngxson Jan 20, 2025
Collaborator

ggerganov Jan 20, 2025
Maintainer

danbev Jan 20, 2025
Collaborator

ngxson Jan 20, 2025
Collaborator

ggerganov Jan 20, 2025
Maintainer

danbev
Jan 14, 2025
Collaborator

ngxson
Jan 20, 2025
Collaborator

tc-mb
Feb 6, 2025