Text encoder for clip #2074

twmht · 2024-01-12T02:58:32Z

twmht
Jan 12, 2024

I can only find the vision encoder but cannot locate the text encoder.

Does timm have an implementation of OpenCLIP's text transformer?

rwightman · 2024-01-13T00:41:37Z

rwightman
Jan 13, 2024
Maintainer

So the goal for timm wrt to image-text models is to have unified modelling interface whether it's supervised or CLIP pretrain, that's why I remap the weights and support the image tower, they're great for downstream image tasks.

If you're after the text model, I don't really have anything text right now. While CLIP is simple in terms of the modelling, it's still a whole other class of models to take on. So right now OpenCLIP or transformers are the best Pytorch options. I help maintain OpenCLIP, and that's where many of the weights since the original OpenAI release have come from....

0 replies

rwightman · 2024-01-13T00:43:15Z

rwightman
Jan 13, 2024
Maintainer

Also, I'm working on another open source project right now that's document & screen (UI) focused. It's also image-text and will be using timm vision + transformers text, so yeah, moving text models here is a lower priority. More likely to see me supporting long envisioned scope of obj detection / segmentation or video models before we see text here :)

1 reply

jhidalgocarrio Oct 7, 2024

@rwightman I understand from your replies that we can use a vision encoder from timm and a text encoder from OpenClip. They should be compatible with computing cosine similarity scores. Right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text encoder for clip #2074

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Text encoder for clip #2074

twmht Jan 12, 2024

Replies: 2 comments · 1 reply

rwightman Jan 13, 2024 Maintainer

rwightman Jan 13, 2024 Maintainer

jhidalgocarrio Oct 7, 2024

twmht
Jan 12, 2024

Replies: 2 comments 1 reply

rwightman
Jan 13, 2024
Maintainer

rwightman
Jan 13, 2024
Maintainer