ColPali: Efficient Document Retrieval with Vision Language Models 👀

[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Associated Paper

This repository contains the code used for training the vision retrievers in the ColPali: Efficient Document Retrieval with Vision Language Models paper. In particular, it contains the code for training the ColPali model, which is a vision retriever based on the ColBERT architecture and the PaliGemma model.

Introduction

With our new model ColPali, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.

List of ColVision models

Model	Score on ViDoRe 🏆	License	Comments	Currently supported
vidore/colpali	81.3	Gemma	• Based on `google/paligemma-3b-mix-448`. • Checkpoint used in the ColPali paper.	❌
vidore/colpali-v1.1	81.5	Gemma	• Based on `google/paligemma-3b-mix-448`. • Fix right padding for queries.	✅
vidore/colpali-v1.2	83.9	Gemma	• Similar to `vidore/colpali-v1.1`.	✅
vidore/colpali-v1.3	84.8	Gemma	• Similar to `vidore/colpali-v1.2`. • Trained with a larger effective batch size of 256 batch size for 3 epochs.	✅
vidore/colqwen2-v0.1	87.3	Apache 2.0	• Based on `Qwen/Qwen2-VL-2B-Instruct`. • Supports dynamic resolution. • Trained using 768 image patches per page and an effective batch size of 32.	✅
vidore/colqwen2-v1.0	89.3	Apache 2.0	• Similar to `vidore/colqwen2-v0.1`, but trained with more powerful GPUs and with a larger effective batch size (256).	✅
vidore/colSmol-256M	80.1	Apache 2.0	• Based on `HuggingFaceTB/SmolVLM-256M-Instruct`.	✅
vidore/colSmol-500M	82.3	Apache 2.0	• Based on `HuggingFaceTB/SmolVLM-500M-Instruct`.	✅

Setup

We used Python 3.11.6 and PyTorch 2.4 to train and test our models, but the codebase is compatible with Python >=3.9 and recent PyTorch versions. To install the package, run:

pip install colpali-engine

Warning

For ColPali versions above v1.0, make sure to install the colpali-engine package from source or with a version above v0.2.0.

Usage

Quick start

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model_name = "vidore/colqwen2-v0.1"

model = ColQwen2.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Benchmarking

To benchmark ColPali on the ViDoRe leaderboard, use the vidore-benchmark package.

Interpretability with similarity maps

By superimposing the late interaction similarity maps on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones.

To use the interpretability module, you need to install the colpali-engine[interpretability] package:

pip install colpali-engine[interpretability]

Then, after generating your embeddings with ColPali, use the following code to plot the similarity maps for each query token:

import torch
from PIL import Image

from colpali_engine.interpretability import (
    get_similarity_maps_from_embeddings,
    plot_all_similarity_maps,
)
from colpali_engine.models import ColPali, ColPaliProcessor
from colpali_engine.utils.torch_utils import get_torch_device

model_name = "vidore/colpali-v1.2"
device = get_torch_device("auto")

# Load the model
model = ColPali.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
).eval()

# Load the processor
processor = ColPaliProcessor.from_pretrained(model_name)

# Load the image and query
image = Image.open("shift_kazakhstan.jpg")
query = "Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?"

# Preprocess inputs
batch_images = processor.process_images([image]).to(device)
batch_queries = processor.process_queries([query]).to(device)

# Forward passes
with torch.no_grad():
    image_embeddings = model.forward(**batch_images)
    query_embeddings = model.forward(**batch_queries)

# Get the number of image patches
n_patches = processor.get_n_patches(image_size=image.size, patch_size=model.patch_size)

# Get the tensor mask to filter out the embeddings that are not related to the image
image_mask = processor.get_image_mask(batch_images)

# Generate the similarity maps
batched_similarity_maps = get_similarity_maps_from_embeddings(
    image_embeddings=image_embeddings,
    query_embeddings=query_embeddings,
    n_patches=n_patches,
    image_mask=image_mask,
)

# Get the similarity map for our (only) input image
similarity_maps = batched_similarity_maps[0]  # (query_length, n_patches_x, n_patches_y)

# Tokenize the query
query_tokens = processor.tokenizer.tokenize(query)

# Plot and save the similarity maps for each query token
plots = plot_all_similarity_maps(
    image=image,
    query_tokens=query_tokens,
    similarity_maps=similarity_maps,
)
for idx, (fig, ax) in enumerate(plots):
    fig.savefig(f"similarity_map_{idx}.png")

For a more detailed example, you can refer to the interpretability notebooks from the ColPali Cookbooks 👨🏻‍🍳 repository.

Token pooling

Token pooling is a CRUDE-compliant method (document addition/deletion-friendly) that aims at reducing the sequence length of multi-vector embeddings. For ColPali, many image patches share redundant information, e.g. white background patches. By pooling these patches together, we can reduce the amount of embeddings while retaining most of the page's signal. Retrieval performance with hierarchical mean token pooling on image embeddings can be found in the ColPali paper. In our experiments, we found that a pool factor of 3 offered the optimal trade-off: the total number of vectors is reduced by $66.7%$ while $97.8%$ of the original performance is maintained.

To use token pooling, you can use the HierarchicalEmbeddingPooler class from the colpali-engine package:

import torch

from colpali_engine.compression.token_pooling import HierarchicalTokenPooler

# Dummy embeddings
list_embeddings = [
    torch.rand(10, 768),
    torch.rand(20, 768),
]

# Define the pooler with the desired level of compression
pooler = HierarchicalTokenPooler(pool_factor=2)

# Pool the embeddings
outputs = pooler.pool_embeddings(list_embeddings)

Training

To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies to use the training script for ColPali. You can do this using the following command:

pip install "colpali-engine[train]"

All the model configs used can be found in scripts/configs/ and rely on the configue package for straightforward configuration. They should be used with the train_colbert.py script.

Example 1: Local training

USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml

or using accelerate:

accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml

Example 2: Training on a SLURM cluster

sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1  -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"

sbatch --nodes=1  --time=5:00:00 -A cad15443 --gres=gpu:8  --constraint=MI250 --job-name=colpali --wrap="python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml"

Community Projects

Several community projects and ressources have been developed around ColPali to facilitate its usage. Feel free to reach out if you want to add your project to this list!

Libraries 📚

Library Name	Description
Byaldi	`Byaldi` is RAGatouille's equivalent for ColPali, leveraging the `colpali-engine` package to facilitate indexing and storing embeddings.
PyVespa	`PyVespa` allows interaction with Vespa, a production-grade vector database, with detailed ColPali support.
Candle	Candle enables ColPali inference with an efficient ML framework for Rust.
EmbedAnything	`EmbedAnything` Allows end-to-end ColPali inference with both Candle and ONNX backend.
DocAI	DocAI uses ColPali with GPT-4o and Langchain to extract structured information from documents.
VARAG	VARAG uses ColPali in a vision-only and a hybrid RAG pipeline.
ColBERT Live!	`ColBERT Live!` enables ColPali usage with vector databases supporting large datasets, compression, and non-vector predicates.
ColiVara	`ColiVara` is retrieval API that allows you to store, search, and retrieve documents based on their visual embedding. It is a web-first implementation of the ColPali paper using ColQwen2 as the LLM model.
BentoML	Deploy ColPali easily with BentoML using this example repository. BentoML features adaptive batching and zero-copy I/O to minimize overhead.

Notebooks 📙

Notebook Title	Author & Link
ColPali Cookbooks	Tony's Cookbooks (ILLUIN) 🙋🏻
Vision RAG Tutorial	Manu's Vision Rag Tutorial (ILLUIN) 🙋🏻
ColPali (Byaldi) + Qwen2-VL for RAG	Merve's Notebook (HuggingFace 🤗)
Indexing ColPali with Qdrant	Daniel's Notebook (HuggingFace 🤗)
Weaviate Tutorial	Connor's ColPali POC (Weaviate)
Use ColPali for Multi-Modal Retrieval with Milvus	Milvus Documentation
Data Generation	Daniel's Notebook (HuggingFace 🤗)
Finance Report Analysis with ColPali and Gemini	Jaykumaran (LearnOpenCV)
Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)	Sergio Paniego
Document Similarity Search with ColPali	Frank Sommers
End-to-end ColPali inference with EmbedAnything	Akshay Ballal (EmbedAnything)
ColiVara: A ColPali Retrieval API	A simple RAG Example
Multimodal RAG with Document Retrieval (ColPali), Vision Language Model (ColQwen2) and Amazon Nova	Suman's Notebook (AWS)

Other resources

📝 = blog post
📋 = PDF / slides
📹 = video

Title	Author & Link
State of AI report 2024	Nathan's report 📋
Technology Radar Volume 31 (October 2024)	thoughtworks's report 📋
LlamaIndex Webinar: ColPali - Efficient Document Retrieval with Vision Language Models	LlamaIndex's Youtube video 📹
PDF Retrieval with Vision Language Models	Jo's blog post #1 (Vespa) 📝
Scaling ColPali to billions of PDFs with Vespa	Jo's blog post #2 (Vespa) 📝
Neural Search Talks: ColPali (with Manuel Faysse)	Zeta Alpha's Podcast 📹
Multimodal Document RAG with Llama 3.2 Vision and ColQwen2	Zain's blog post (Together AI) 📝
ColPali: Document Retrieval with Vision Language Models	Antaripa Saha 📝
Minimalist diagrams explaining ColPali	Leonie's ColPali diagrams on X 📝
Multimodal RAG with ColPali and Gemini : Financial Report Analysis Application	Jaykumaran's blog post (LearnOpenCV) 📝
Implement Multimodal RAG with ColPali and Vision Language Model Groq(Llava) and Qwen2-VL	Plaban's blog post 📝
multimodal AI. open-source. in a nutshell.	Merve's Youtube video 📹
Remove Complexity from Your RAG Applications	Kyryl's blog post (KOML) 📝
Late interaction & efficient Multi-modal retrievers need more than a vector index	Ayush Chaurasia (LanceDB) 📝
Optimizing Document Retrieval with ColPali and Qdrant's Binary Quantization	Sabrina Aquino (Qdrant) 📹
Hands-On Multimodal Retrieval and Interpretability (ColQwen + Vespa)	Antaripa Saha 📝

Paper result reproduction

To reproduce the results from the paper, you should checkout to the v0.1.1 tag or install the corresponding colpali-engine package release using:

pip install colpali-engine==0.1.1

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)

@misc{faysse2024colpaliefficientdocumentretrieval,
      title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2407.01449},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.01449}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
assets		assets
colpali_engine		colpali_engine
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColPali: Efficient Document Retrieval with Vision Language Models 👀

Associated Paper

Introduction

List of ColVision models

Setup

Usage

Quick start

Benchmarking

Interpretability with similarity maps

Token pooling

Training

Example 1: Local training

Example 2: Training on a SLURM cluster

Community Projects

Libraries 📚

Notebooks 📙

Other resources

Paper result reproduction

Citation

About

Releases 13

Packages

Contributors 13

Languages

License

illuin-tech/colpali

Folders and files

Latest commit

History

Repository files navigation

ColPali: Efficient Document Retrieval with Vision Language Models 👀

Associated Paper

Introduction

List of ColVision models

Setup

Usage

Quick start

Benchmarking

Interpretability with similarity maps

Token pooling

Training

Example 1: Local training

Example 2: Training on a SLURM cluster

Community Projects

Libraries 📚

Notebooks 📙

Other resources

Paper result reproduction

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 13

Languages

Packages