Replies: 1 comment 1 reply
-
New Multimodal LLMs are coming out 🎉 . Currently investigating Phi-3-Vision and MiniCPM-Llama3-V-2_5. If you have any thoughts let me know. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While the basic structure for multimodality integration is there in the code, I cannot find a suitable model to run it with. Most models' projectors are either too low resolution (~400px), or the underlying LLM is too weak to be usable. The only (open source) multimodal LLM that has a high enough resolution, is smart enough, and is good enough at OCR seems to be OpenGVLab/InternVL, but that model is way too large to run on anything I have access to.
If new models come out that meet the above requirements, please let me know about them in this issue. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions