The new Yi-VL-6B and 34B multimodals ( inferenced on llama.cpp, results here ) #5092

cmp-nct · 2024-01-23T04:26:14Z

cmp-nct
Jan 23, 2024

Well, their benchmarks claim they are almost at GPT4V level, beating everything else by a mile.
They also claim that CovVLM is one of the worst (and it's actually the best next to GPT4, by far)
On the other hand there are a few improvements in Yi-VL:

Images run at 448² pixels, compared to 336
The image encoder is openclip-HUGE instead of laion-LARGE
The projector contains layer normalization steps
quite a lot of training

I've tested Yi-6B and 34B.
PR Update: #5093

Update:
GGUF models for both: https://huggingface.co/cmp-nct
When used on "normal photos" Yi-VL-34B produces quite good results but I've had it break the finetune and ask questions as "Human".
I can't rule out that there are implementation issues remaining - in the PR thread I've posted another sample response with two cats.
Overall Yi-VL responds well to strong quantization, even at ~3bpw LLM quant I noticed no real degradation in quality, also running the visual tower quantized did not result in lower quality.
That's similar as with other llava models

The famous driver license OCR test follows:
PS Q:\llama.cpp\build> .\bin\Debug\llava-cli.exe -m Q:\models\llava\Yi-VL-6B\ggml-model-f16.gguf --mmproj Q:\models\llava\Yi-VL-6B\vit\mmproj-model-f16.gguf --image C:\temp\license_demo.jpg -p "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。 \n\n### Human: <image>\nProvide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert.\n### Assistant:" -ngl 50 --temp 0 -n 500 -c 2048 -e

{
  "driver_license": "California",
  "class": "a",
  "driver_license_number": "DL1234568",
  "expiration_date": "08/31/2014",
  "end_name": "END NONE",
  "lncardholder": "FNMA",
  "street_address": "2570 24TH STREET ANYTOWN CA 55918",
  "city": "ANYTOWN",
  "state": "CA",
  "zipcode": "55918",
  "phone": "08/03/1977",
  "driver_license_type": "VETERAN",
  "hair_color": "HAIR NONE",
  "eyes_color": "EYES NONE",
  "height": "125",
  "weight": "125",
  "ssn": "08311977",
  "cardholder_name": "Jane Doe",
  "cardholder_id": "DD 00000000NNANANANANF09",
  "issued_by": "DMV"
}

So that's certainly not CogVLM or GPT4 level.
It's significantly more stupid than ShareGPT4V 7B but at the same time, it extracted a LOT out of it.
Still plenty of errors, CogVLM aces this test with 2 tiny errors and GPT4V has one tiny error.

I ran other tests, on images that are working quite well but not flawless on ShareGPT4V-7B and 13B.
Yi-VL-6B showed a remarkable great detail detection, better than any other llava models, but it hallucinated extremely alongside that, more than I've seen before anywhere else..

I'll follow up with a 34B test, likely tomorrow given I've to download and quantize it first.
I expect a lot more hallucination and more intelligence at the same time, we'll see.

cmp-nct · 2024-01-23T05:13:47Z

cmp-nct
Jan 23, 2024
Author

For comparison:
ShareGPT4V13B:

 {
"image": "https://i.imgur.com/89v6JjK.jpg",
"license\_number": "112345678",
"expiration\_date": "08/31/2014",
"cardholder\_name": "Ima Cardholder",
"address": "123 Anytown, CA 90210",
"city": "Anytown",
"state": "CA",
"zip\_code": "90210",
"date\_of\_birth": "08/31/1977",
"sex": "F",
"height": "5' 3.0\"",
"weight": "123 lb",
"eye\_color": "Brown",
"hair\_color": "Black",
"veteran\_status": "Yes",
"issued\_date": "08/31/2009",
"expiration\_date": "08/31/2014",
"classification": "Class C",
"signature": "Ima Cardholder"
}

0 replies

chigkim · 2024-01-23T16:05:56Z

chigkim
Jan 23, 2024

Is there gguf for CogVLM?

1 reply

cmp-nct Jan 24, 2024
Author

No, CogVLM is not supported by llama.cpp - it requires a change to the language model architecture.
CogVLM is running with distinct embeddings if I recall right, one visual and one language. So the transformers arch would need to handle those additional steps.

cmp-nct · 2024-01-24T01:04:29Z

cmp-nct
Jan 24, 2024
Author

Getting the 34B model running was a bit more work, I consider my time wasted.
Yi-VL is a hallucinating nightmare.
Below is a high quality quantization (Q6K on llm and ViT) running the same license demo.
Basically a high quality hallucination.

I ran quite a bit through the implementation, the parameters look good, embeddings appear to be ok.
I think that the new model, advertised to be GPT4V level, is just not anywhere close to it's marketing.

llama_kv_cache_init:      CUDA0 KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model: graph splits (measure): 5
llama_new_context_with_model:      CUDA0 compute buffer size =   270.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   270.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    18.00 MiB

encode_image_with_clip: image encoded in   173.58 ms by CLIP (    0.17 ms per image patch)
system_prompt: This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个 好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI 助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human:
user_prompt:
Provide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert.
### Assistant:

 [
  'id': 1124586,
  'class': 'California DIVERSE LICENSECLASS',
  'dl': 'EXPI0-031-048',
  'endorsement': 'ENDORSEMENT',
  'expiry': '09/2014',
  'status': 'ACTIVE',
  'address': '2570 3RD STREET',
  'city': 'SAN CARLOS',
  'state': 'CA',
  'zip': '94302',
  'phone': '625-5187',
  'email': 'JOHN@DRIVERSHQ.COM',
  'birthdate': '06/01/1970',
  'height': '5FT 6IN',
  'weight': '120 LBS',
  'haircolor': 'BROWN',
  'eye color': 'BLUE',
  'drivers_license': 'CALIFORNIA',
  'id_cardholder': 'YES',
  'dl_group': 'EXPI0-031-048',
  'dl_number': '092014',
  'dl_class': 'ACTIVE',
  'dl_status': 'EXPIRED',
  'dl_endorsement': 'ENDORSEMENT',
  'dl_expiry': '09/2014',
  'dl_group_name': 'CALIFORNIA DIVERSE LICENSECLASS',
  'dl_number': '092014',
  'dl_class': 'ACTIVE',
  'dl_status': 'EXPIRED',
  'dl_endorsement': 'ENDORSEMENT',
  'dl_expiry': '09/2014',
  'dl_group_name': 'CALIFORNIA DIVERSE LICENSECLASS',


llama_print_timings:        load time =    6899.53 ms
llama_print_timings:      sample time =    3110.61 ms /   500 runs   (    6.22 ms per token,   160.74 tokens per second)
llama_print_timings: prompt eval time =    1378.95 ms /  1151 tokens (    1.20 ms per token,   834.69 tokens per second)
llama_print_timings:        eval time =   21994.62 ms /   500 runs   (   43.99 ms per token,    22.73 tokens per second)
llama_print_timings:       total time =   33985.85 ms /  1651 tokens

I ran some more tests with Yi-VL-34B on easier images and there is is performing better.
It's still hallucinating but better than VL-6B.

1 reply

jasonmcaffee May 29, 2024

Thank you! Your time was not wasted. It helped me choose where to spend my time, which is not on this model :)

chigkim · 2024-01-24T12:22:13Z

chigkim
Jan 24, 2024

Too bad! :( Thanks for your effort!

Then Cog vlm is the best vision language model that people can run locally so far?

0 replies

chigkim · 2024-01-24T12:49:44Z

chigkim
Jan 24, 2024

Have you tried InternVL? It's also based on Llava, but doesn't work with Llama.cpp. It sounds like they're working on a higher resolution model.

https://github.com/OpenGVLab/InternVL/

2 replies

cmp-nct Jan 24, 2024
Author

I tried their demo a while ago and wasn't blown away.
I'll give Yi more tests at a later time, the 2.8 bpw quantization wrote acceptable captions on normal images, so maybe worth a closer look.

The real master is Cog-VLM. I won't get int that implementation myself atm, should be done by someone who does more architecture implementations.

P.S. also always the possibility that I made an error in the implementation or missed something. Though it's working quite well for that to be the case, it would need to be something small.

sapere-aude-incipe May 5, 2024

There is now an InternVL 1.5, you can try it here:

https://internvl.opengvlab.com/

Based on my testing, it is SOTA for open source vision models.

Would you be interested in trying to get it to work in llama.cpp?

brucethemoose · 2024-01-25T02:35:11Z

brucethemoose
Jan 25, 2024

Could using 0 temperature be an issue?

I've found that base Yi is nutty at 0 temperature (with longer output) and with default llama sampling parameters, but really tightens up with a custom high MinP config.

I don't use it for text extraction though.

1 reply

cmp-nct Jan 25, 2024
Author

Greedy sampling should be the right choice to judge the model here.
If you use non greedy sampling and the model produces better quality output, then you relied on a chance of luck that the original bad output turns into something better.

Maybe there is something I missed in the implementation - though I had the 34B model produce quite good results on other images - so if there is something wrong it should perform bad on all tests.

The Yi-VL models also do not have a proper terminator in their model/finetune, you have to manually look for "###" which is their stopword.
As opposed to </s> or <|end-of-text|> or something similar and registering it properly as end token.

I might be wrong, but it looks like the models were rushed to an end for some reason.

tdilber · 2024-01-28T09:52:05Z

tdilber
Jan 28, 2024

I try to run your gguf files with docker but it didnt working. Can u help me pls
docker run --gpus all --name Yi-VL-6B_Q5_GPU_3 -v ~/.cache/models-gguf:/models -d -p 8003:8003 ghcr.io/ggerganov/llama.cpp:full-cuda --server -m /models/Yi-VL-6B-Q5_K.gguf --mmproj /models/Yi-VL-6B-mmproj-f16-q6_k.gguf --port 8003 --host 0.0.0.0 -c 2048

error logs:
`
Multi Modal Mode Enabledterminate called after throwing an instance of 'std::runtime_error'
what(): get_tensor: unable to find tensor mm.2.weight

/app/.devops/tools.sh: line 45: 6 Aborted (core dumped) ./server "$@"
`

1 reply

Lzhang-hub Mar 1, 2024

I have same question

Dawn-Xu-helloworld · 2024-05-11T06:28:01Z

Dawn-Xu-helloworld
May 11, 2024

So, how to convert Yi-VL to a gguf model? Just "python convert-hf-to-gguf.py /path/to/model" ？

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The new Yi-VL-6B and 34B multimodals ( inferenced on llama.cpp, results here ) #5092

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The new Yi-VL-6B and 34B multimodals ( inferenced on llama.cpp, results here ) #5092

Replies: 8 comments · 6 replies

cmp-nct Jan 23, 2024 Author

cmp-nct Jan 24, 2024 Author

cmp-nct Jan 24, 2024 Author

cmp-nct Jan 24, 2024 Author

cmp-nct Jan 25, 2024 Author

Replies: 8 comments 6 replies

cmp-nct
Jan 23, 2024
Author

cmp-nct Jan 24, 2024
Author

cmp-nct
Jan 24, 2024
Author

cmp-nct Jan 24, 2024
Author

cmp-nct Jan 25, 2024
Author