Multimodality Feature #12

kimjammer · 2024-05-05T00:30:04Z

kimjammer
May 5, 2024
Maintainer

While the basic structure for multimodality integration is there in the code, I cannot find a suitable model to run it with. Most models' projectors are either too low resolution (~400px), or the underlying LLM is too weak to be usable. The only (open source) multimodal LLM that has a high enough resolution, is smart enough, and is good enough at OCR seems to be OpenGVLab/InternVL, but that model is way too large to run on anything I have access to.

If new models come out that meet the above requirements, please let me know about them in this issue. Thanks!

kimjammer · 2024-05-23T03:05:12Z

kimjammer
May 23, 2024
Maintainer Author

New Multimodal LLMs are coming out 🎉 . Currently investigating Phi-3-Vision and MiniCPM-Llama3-V-2_5. If you have any thoughts let me know.

1 reply

kimjammer Jun 12, 2024
Maintainer Author

Decided to use MiniCPM-Llama3-V-2_5! It has great OCR, high resolution image input, and is based on Llama 3 so responses from the text-only and multimodal modes are very similar! What time to be alive!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodality Feature #12

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multimodality Feature #12

kimjammer May 5, 2024 Maintainer

Replies: 1 comment · 1 reply

kimjammer May 23, 2024 Maintainer Author

kimjammer Jun 12, 2024 Maintainer Author

kimjammer
May 5, 2024
Maintainer

Replies: 1 comment 1 reply

kimjammer
May 23, 2024
Maintainer Author

kimjammer Jun 12, 2024
Maintainer Author