vision-and-language

Here are 190 public repositories matching this topic...

roboflow / maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

transformers vqa objectdetection captioning fine-tuning multimodal vision-and-language phi-3-vision paligemma florence-2 qwen2-vl

Updated Aug 4, 2025
Python

om-ai-lab / OmAgent

Star

Build multimodal language agents for fast prototype and production

Updated Mar 19, 2025
Python

salesforce / ALBEF

Star

Code for ALBEF: a new vision-language pre-training method

representation-learning weakly-supervised-learning image-text vision-and-language contrastive-learning

Updated Sep 20, 2022
Python

open-mmlab / Multimodal-GPT

Star

Multimodal-GPT

transformer llama gpt flamingo multimodal vision-and-language gpt-4

Updated Jun 4, 2023
Python

dandelin / ViLT

Star

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

vision-and-language

Updated Apr 3, 2024
Python

om-ai-lab / OmDet

Star

Real-time and accurate open-vocabulary end-to-end object detection

real-time computer-vision coco object-detection zero-shot vision-and-language lvis zero-shot-object-detection open-vocabulary

Updated Dec 18, 2024
Python

NVlabs / prismer

Star

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

vqa image-captioning language-model multi-task-learning vision-and-language multi-modal-learning vision-language-model

Updated Jan 17, 2024
Python

microsoft / Oscar

Star

Oscar and VinVL

vqa image-captioning oscar vision-and-language pre-training image-text-search vinvl

Updated Aug 28, 2023
Python

OFA-Sys / ONE-PEACE

Star

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

representation-learning multimodal vision-and-language contrastive-loss vision-language vision-transformer foundation-models audio-language

Updated Oct 6, 2024
Python

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

image-captioning video-captioning visual-question-answering vision-and-language cross-modal-retrieval pretraining tden

Updated Feb 27, 2023
Python

mbzuai-oryx / groundingLMM

Star

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

vision-and-language lmm foundation-models vision-language-model llm-agent

Updated Aug 5, 2025
Python

InternRobotics / PointLLM

Star

[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large Language Models to Understand Point Clouds

chatbot point-cloud llama representation-learning 3d multimodal vision-and-language gpt-4 foundation-models large-language-models objaverse pointllm

Updated May 22, 2025
Python

NVlabs / DoRA

Star

[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation

deep-neural-networks deep-learning lora commonsense-reasoning vision-and-language large-language-models parameter-efficient-tuning instruction-tuning large-vision-language-models parameter-efficient-fine-tuning

Updated Oct 1, 2024
Python

ChenRocks / UNITER

Star

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

transformers pytorch vision-and-language pre-training

Updated Jun 30, 2021
Python

SkalskiP / top-cvpr-2025-papers

Sponsor

Star

About This repository is a curated collection of the most exciting and influential CVPR 2025 papers. 🔥 [Paper + Code + Demo]

computer-vision paper transformers object-detection image-segmentation cvpr multimodal vision-and-language vision-language-model cvpr2025

Updated Jun 16, 2025
Python

SkalskiP / top-cvpr-2024-papers

Sponsor

Star

This repository is a curated collection of the most exciting and influential CVPR 2024 papers. 🔥 [Paper + Code + Demo]

computer-vision paper transformers object-detection image-segmentation cvpr vision-and-language cvpr2024

Updated Jun 2, 2025
Python

jayleicn / ClipBERT

Star

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

pytorch vqa vision-and-language video-retrieval video-question-answering cvpr2021