This repo provides the PyTorch source code of our paper: Describing Differences in Image Sets with Natural Language (CVPR 2024 Oral). Check out project page here!
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two sets of images, which we term Set Difference Captioning. This task takes in image sets
Here we provide a minimal example to describe the differences between two sets of images, where set A are images showing people practicing yoga in a mountainous setting
and set B are images showing people meditating in a mountainous setting
.
- Install dependencies:
pip install -r requirements.txt
- Login wandb account:
wandb login
- Describe differences:
python main.py --config configs/example.yaml
After that, you should see the following results in wandb.
If you want to use VisDiff on your own datasets, you can follow the following steps.
Convert your dataset to CSV format with two required columns path
and group_name
. An example of the CSV files can be found in data/examples.csv.
To describe the differences between two datasets, we need a proposer
and a ranker
. The proposer randomly samples subsets of images to generate a set of candidate differences. The ranker then scores the salience and significance of each candidate.
We have implemented different proposers and rankers in components/proposer.py and components/ranker.py. To use each of them, you can edit arguments in configs/base.yaml.
We put all the general arguments in configs/base.yaml and dataset specific arguments in configs/example.yaml.
We unify all the LLMs, VLMs, and CLIP to API servers for faster inference. Follow the instructions in serve/ to start these servers.
For example, if you use BLIP-2 + GPT-4 as proposer and CLIP as ranker, you need to start the following servers:
python serve/clip_server.py
python serve/vlm_server_blip.py
Finally, you can run the following command to describe the differences between two datasets:
python main.py --config configs/example.yaml
To evaluate our system, we collected VisDiffBench, a benchmark of 187 paired image sets with ground truth difference descriptions (download link). To evaluate performance on VisDiffBench, we ask VisDiff to output a description for each paired set and compare it to the ground truth using GPT-4 evaluator.
VisDiffBench is collected from the following datasets:
- PairedImageSets (Collection Code)
- ImageNetR
- ImageNet*
To evaluate VisDiff, you can run the codes in sweeps/:
python sweeps/sweep_pairedimagesets.py
python sweeps/sweep_imagenet.py
For each application, we provide the corresponding codes and usages in applications folder.
If you use this repo in your research, please cite it as follows:
@inproceedings{VisDiff,
title={Describing Differences in Image Sets with Natural Language},
author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}