This is the implementation of the paper Iterated Learning Improves Compositionality in Large Vision-Language Models.
We design an iterated learning algorithm that improves the compositionality in large vision-language models, inspired by cultural transmission theory in cognitive science.
Please run the following commands to initiate a fresh conda environment and install the required packages.
conda create -n clipenv python=3.8
conda activate clipenv
pip install -r requirements.txt
To evaluate the model in compositionality benchmarks like CREPE, SugarCREPE, you need to download their required data. Checkout the correponding repositories for details.
All the testing scripts are integrated at test.sh
. To evaluate the model, simply run:
bash test.sh <model-type> <checkpoint-path> <task>
<model-type>
can be fdt
if you are evaluating codebook varients of CLIP model (like the model that we use), or clip
if evaluting CLIP baseline.
<checkpoint-path>
is the folder that contain model checkpoints.
<task>
can be one of compositionality
, retrieval
, recognition
, probing
.
Note that we use clip-benchmark for evaluating recognition and retrieval. It will automatically download the required datasets in data
folder.
The pretrained model checkpoints can be found here
First, to prepare the data for training, we recommand using publically available image-text datasets, such as Conceptual Captions (CC3M), Conceptual 12M (CC12M), and LAION115M. img2dataset is a very convenient tool for downloading these large scale data.
After preparaing the data (cc3m in this example), to train a VIT-B/32 CLIP model using our iterated learning algorithm, please run
bash run.sh example/clip_fdt/train_solver.py \
--config example/clip_fdt/config_cc3m.yaml \
--output_path output \
--batch_size 256 \
--exp_name cc3m_IL_6000
This scripts assume the usage of 4 gpus in 1 node. You can modify the gpu number and node number in run.sh
.
To train a baseline CLIP model (also ViT-B/32), please run
bash run.sh example/clip/train_solver.py \
--config example/clip/config_cc3m.yaml \
--output_path output \
--batch_size 256 \
--exp_name baseline_clip
If you find this repository useful, please consider citing:
@inproceedings{zheng2024iterated,
title={Iterated learning improves compositionality in large vision-language models},
author={Zheng, Chenhao and Zhang, Jieyu and Kembhavi, Aniruddha and Krishna, Ranjay},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13785--13795},
year={2024}
}
Part of our code is referenced from the following repositories/sources. We thank the authors for making their code available.