Bongard-OpenWorld

This repository hosts the code for Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World.

Abstract

We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs) : Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even designed a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.

Approaches

We explore four families of approaches: (a) casting Bongard-OpenWorld into a standard ''2-way, 6-shot'' few-shot learning problem and tackling it using state-of-the-art few-shot learners with pretrained image representations; (b) Combining an LLM (reasoner) and a VLM (image captioner) in a single round fashion, where the VLM simply caption each Bongard image and send their captions to LLM for solving this problem; (c) extending the method in (b) to multiple rounds, where the LLM will also iteratively prob the VLM for more image details, resulting in more condense information for solving Bongard; (d) A neuro-symbolic approach, where a VLM generates the initial captions, then an LLM extracts visual concepts from them. These concepts are subsequently updated through logical operations, leveraging the responses provided by VLM, until the problem is solved.

Main Results

All the non-LLM models use a ConvNeXt-base image encoder, and we experiment with different pretraining strategies: no pretraining at all (scratch), pretraining with ImageNet-1K labels (IN-1K), pretraining with full ImageNet-22K labels (IN-22k) and pretraining with LAION-2B dataset (OpenCLIP). While the LLM-based models use either BLIP-x or ChatCaptioner captions as the image representations. For the auxiliary captioning task, the model is connected to the caption decoder of a pretrained BLIP-2-opt-6.7B model. *$denotes common_sense. **$involves utilizing the ground truth concepts from Bongard-OpenWorld training set and the captions from BLIP-2 as inputs to fine-tuning ChatGPT over 5 epochs. The fine-tuned model is evaluated on the test set. It is worth noting that InstructBLIP was not fine-tuned due to a significant drop in its performance on ChatGPT.

method	image representation	aux. task?	short concept	long concept	CS* concept	non-CS* concept	avg.
SNAIL	scratch	✗	52.8	46.2	50.9	49.3	49.8
SNAIL	IN-1K	✗	61.5	54.9	48.2	62.4	58.5
SNAIL	IN-22K	✗	62.8	57.7	54.5	62.8	60.5
SNAIL	OpenCLIP	✗	64.2	57.7	57.3	62.8	61.3
SNAIL	OpenCLIP	✔	66.1	61.5	63.6	64.1	64.0
OpenFlamingo	OpenCLIP	N/A	50.0	48.4	50.9	48.6	49.3
Otter	OpenCLIP	N/A	49.3	49.3	48.9	49.4	49.3
ChatGPT	BLIP-2	N/A	60.6	56.6	55.5	60.0	58.8
ChatGPT	InstructBLIP	N/A	52.1	50.6	48.1	52.7	51.4
ChatGPT	ChatCaptioner	N/A	52.3	45.6	57.3	46.2	49.3
ChatGPT (Fine-tuned)**	BLIP-2	N/A	67.0	58.8	55.5	66.2	63.3
GPT-4	BLIP-2	N/A	64.5	58.0	57.3	63.2	61.6
GPT-4	InstructBLIP	N/A	67.3	59.7	59.3	65.6	63.8
Neuro-Symbolic	InstructBLIP	N/A	58.3	52.2	56.4	55.2	55.5
Human	N/A	N/A	91.7	90.1	89.1	91.7	91.0

Installation

This codebase can be built from scratch on Ubuntu 20.04 with Python 3.10, PyTorch 1.13 and CUDA 11.7.

conda create -n bongard-ow python=3.10
conda activate bongard-ow
conda install pytorch=1.13 torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

Data Preparation

To download all images from the URLs, navigate to the root directory of Bongard-OpenWorld and run scripts/crawl_images.py.

cd Bongard-OpenWorld
python scripts/crawl_images.py

Please note that some links may be invalid due to the instability of the URLs. To ensure that the community can reproduce our results from scratch, we have provided a backup of all the images. You can download from Google Drive.

The images should be extracted to assets/data/bongard-ow/images and the file structure looks like:

assets
├── data
│   └── bongard-ow
│       ├── images
│       │   ├── 0000
│       │   ├── 0001
│       │   ├── ....
│       │   └── 1009
│       ├── bbox_data.pkl
│       ├── bongard_ow.json
│       ├── bongard_ow_train.json
│       ├── bongard_ow_val.json
│       └── bongard_ow_test.json
└── weights

Please note that this repository only hosts the code for Bongard-OpenWorld. All images of Bongard-OpenWorld are crawled from Google Images and should not be considered part of the source code.

We do not claim ownership of any image in Bongard-OpenWorld. Therefore, we strongly recommend that you delete all images immediately after benchmarking all approaches and evaluations.

Traning Few-Shot Models

bash fewshot_learning.sh

Inference VLM + LLM

single-round

bash vlm+llm_single_round.sh

multi-round

bash vlm+llm_multi_round.sh

Thanks for the codebase of ChatCaptioner.

neuro-symbolic

bash neuro-symbolic.sh

Inference end-to-end pretrained VLM

bash vlm.sh

Thanks for the codebase of OpenFlamingo and Otter.

License

Code: Apache
Data: CC BY-NC-SA 4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bongard-OpenWorld

Block or report Bongard-OpenWorld

Bongard-OpenWorld

Abstract

Approaches

Main Results

Installation

Data Preparation

Traning Few-Shot Models

Inference VLM + LLM

single-round

multi-round

neuro-symbolic

Inference end-to-end pretrained VLM

License

Popular repositories Loading