-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multimodal models (LLaVA) #3436
Conversation
Sometime ago I was playing with the idea of allowing images to be uploaded via Would it be helpful for testing if I make a pr with this change ? The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies I could add image upload to the Let me know if you are interested. |
Thanks @staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI. |
I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then. Ok, take your time then, I'll wait until you feel comfortable for UI integration. |
Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend. |
This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it with |
@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized. |
There are still some tasks to do but I think this is ready for testing / feedback / reviews. A pre-converted model can be found here. You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme. I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome. |
@monatis Awesome stuff! I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try! Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck? |
sorry for late reply, as mentioned in like this link: https://replicate.com/blog/how-to-prompt-llama |
Any plan to update the GGUF for LLaVA 1.6 ? |
oh they released them https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2 a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :) edit: blog post https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ |
Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF
|
yup.. can confirm following #2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions?
|
And that's the first one that fails (pretty much the first or second layer lmao) |
Looping in @haotian-liu and @cmp-nct in case they could help with Llava V1.6. |
I've got a hacked up script that works for 1.6, will share shortly on a fork raw script (breaks llava 1.5 support): llava1.6-surgery-hack.py
note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them |
I'm also half way but occupied with real world stuff. I've created a pull draft to use as a base for 1.6 #5267 Right now I am struggling with the new ViT When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already. |
awesome! thanks @cjpais .. throwing into LMStudio for testing now |
Did it work in LM Studio? |
@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio. |
You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp? |
bruh moment |
I'm use the llava how to modify bach size to avoid this error
tokenizer, model, image_processor, context_len = load_pretrained_model( |
You're almost certainly looking for https://github.com/haotian-liu/LLaVA. This is the llama.cpp repo. |
closes #3332
This is still WIP and highly experimental.
The work started in lmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.
The plan is make a surgery on LLaVA models and export:
llava
executable.usage:
This will output the detailed description of the image.
Note: You can override the default textual prompt "Describe the image in detail." by adding
-p "custom promp comes here"
. Run./bin/llava
for other options.Note: A lower temperature value like 0.1 is recommended. Add
--temp 0.1
to your command to do so.