Replies: 2 comments 1 reply
-
So the goal for timm wrt to image-text models is to have unified modelling interface whether it's supervised or CLIP pretrain, that's why I remap the weights and support the image tower, they're great for downstream image tasks. If you're after the text model, I don't really have anything text right now. While CLIP is simple in terms of the modelling, it's still a whole other class of models to take on. So right now OpenCLIP or transformers are the best Pytorch options. I help maintain OpenCLIP, and that's where many of the weights since the original OpenAI release have come from.... |
Beta Was this translation helpful? Give feedback.
-
Also, I'm working on another open source project right now that's document & screen (UI) focused. It's also image-text and will be using timm vision + transformers text, so yeah, moving text models here is a lower priority. More likely to see me supporting long envisioned scope of obj detection / segmentation or video models before we see text here :) |
Beta Was this translation helpful? Give feedback.
-
I can only find the vision encoder but cannot locate the text encoder.
Does timm have an implementation of OpenCLIP's text transformer?
Beta Was this translation helpful? Give feedback.
All reactions