Replies: 8 comments 6 replies
-
For comparison:
|
Beta Was this translation helpful? Give feedback.
-
Is there gguf for CogVLM? |
Beta Was this translation helpful? Give feedback.
-
Getting the 34B model running was a bit more work, I consider my time wasted. I ran quite a bit through the implementation, the parameters look good, embeddings appear to be ok.
I ran some more tests with Yi-VL-34B on easier images and there is is performing better. |
Beta Was this translation helpful? Give feedback.
-
Too bad! :( Thanks for your effort! Then Cog vlm is the best vision language model that people can run locally so far? |
Beta Was this translation helpful? Give feedback.
-
Have you tried InternVL? It's also based on Llava, but doesn't work with Llama.cpp. It sounds like they're working on a higher resolution model. |
Beta Was this translation helpful? Give feedback.
-
Could using 0 temperature be an issue? I've found that base Yi is nutty at 0 temperature (with longer output) and with default llama sampling parameters, but really tightens up with a custom high MinP config. I don't use it for text extraction though. |
Beta Was this translation helpful? Give feedback.
-
I try to run your gguf files with docker but it didnt working. Can u help me pls error logs: /app/.devops/tools.sh: line 45: 6 Aborted (core dumped) ./server "$@" |
Beta Was this translation helpful? Give feedback.
-
So, how to convert Yi-VL to a gguf model? Just "python convert-hf-to-gguf.py /path/to/model" ? |
Beta Was this translation helpful? Give feedback.
-
Well, their benchmarks claim they are almost at GPT4V level, beating everything else by a mile.
They also claim that CovVLM is one of the worst (and it's actually the best next to GPT4, by far)
On the other hand there are a few improvements in Yi-VL:
I've tested Yi-6B and 34B.
PR Update: #5093
Update:
GGUF models for both: https://huggingface.co/cmp-nct
When used on "normal photos" Yi-VL-34B produces quite good results but I've had it break the finetune and ask questions as "Human".
I can't rule out that there are implementation issues remaining - in the PR thread I've posted another sample response with two cats.
Overall Yi-VL responds well to strong quantization, even at ~3bpw LLM quant I noticed no real degradation in quality, also running the visual tower quantized did not result in lower quality.
That's similar as with other llava models
The famous driver license OCR test follows:
![image](https://private-user-images.githubusercontent.com/78893154/298803907-1d19f74f-2589-46bd-aff8-d7fa0a0baa42.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NDM3ODAsIm5iZiI6MTczOTU0MzQ4MCwicGF0aCI6Ii83ODg5MzE1NC8yOTg4MDM5MDctMWQxOWY3NGYtMjU4OS00NmJkLWFmZjgtZDdmYTBhMGJhYTQyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDE0MzEyMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM3Y2UwMmI4M2Q2M2E2ZGVkN2ZlNDkwMzhjNmM0Y2RkZjdlMTBhZGZmYzBmZGUxNTg2ZTUzODEyNzUwMjBmNjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.tc7oJPJg61lhZqbwgTZ-OuvFXlO4-GYE7MipoU0M-YA)
PS Q:\llama.cpp\build> .\bin\Debug\llava-cli.exe -m Q:\models\llava\Yi-VL-6B\ggml-model-f16.gguf --mmproj Q:\models\llava\Yi-VL-6B\vit\mmproj-model-f16.gguf --image C:\temp\license_demo.jpg -p "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角 色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。 \n\n### Human: <image>\nProvide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert.\n### Assistant:" -ngl 50 --temp 0 -n 500 -c 2048 -e
So that's certainly not CogVLM or GPT4 level.
It's significantly more stupid than ShareGPT4V 7B but at the same time, it extracted a LOT out of it.
Still plenty of errors, CogVLM aces this test with 2 tiny errors and GPT4V has one tiny error.
I ran other tests, on images that are working quite well but not flawless on ShareGPT4V-7B and 13B.
Yi-VL-6B showed a remarkable great detail detection, better than any other llava models, but it hallucinated extremely alongside that, more than I've seen before anywhere else..
I'll follow up with a 34B test, likely tomorrow given I've to download and quantize it first.
I expect a lot more hallucination and more intelligence at the same time, we'll see.
Beta Was this translation helpful? Give feedback.
All reactions