Why do vision models need specific surgery files? #11139
Replies: 4 comments 12 replies
-
I could me mistaken here but I was under the impression that in the future it would be possible to convert a model which contains both a vision encoder and a language model into a single .gguf. And when the new vision api is in place it would then be possible to have a single .gguf file containing both. This is what I did for Llama 3.2 Vision in this branch. So perhaps if the above is true it might make sense to hold off on moving this to |
Beta Was this translation helpful? Give feedback.
-
@bartowski1182 Sorry about misleading you on this. It turned out my assumptions were incorrect. So I think your idea of adding a |
Beta Was this translation helpful? Give feedback.
-
To answer you original question @bartowski1182 , the reason why we need these "surgery" scripts was because historically, the llava example was developed outside of the main llama.cpp and
Of course, we can modify |
Beta Was this translation helpful? Give feedback.
-
I would like to talk about my current understanding here. I suggest that the scripts for converting models can be merged. However, using two ggufs for llm and vision separately can better perform quantization of different precisions.This idea is different from designing from the code structure. I agree that llamacpp can well support the performance of the llm part at different precisions. However, the multimodal model uses another network to extract features to obtain the embed result, and this process is more sensitive to some disturbances. I observed that if a picture is compressed to a smaller number of tokens, the impact on the quantization of the vision part will be more obvious, much greater than the impact of llm. This means that if the vision part and the llm part are placed in one gguf:
Of course, I just expressed a cognition that is not easy to notice. If there is an elegant way to merge, I also support using one gguf file. |
Beta Was this translation helpful? Give feedback.
-
For example, then recent Qwen2VL implementation.. Everything within the qwen2_vl_surgery.py is.. just python code, is there a reason it couldn't be added to convert_hf_to_gguf.py? We are already detecting the architecture is qwen2vl, seems simple enough to add all the surgery code into that block and have both be done in one, especially if we add an optional
--vision-adapter
param that's ignored for non-vision models and used with vision ones to specify that the vision adapter should be made.I ask because I was considering making the change, but if there's a specific reason that it hasn't been done that way I won't bother, maybe I'm missing something
Beta Was this translation helpful? Give feedback.
All reactions