This repository hosts the code for Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World.
We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs) : Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even designed a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
We explore four families of approaches: (a) casting Bongard-OpenWorld into a standard ''2-way, 6-shot'' few-shot learning problem and tackling it using state-of-the-art few-shot learners with pretrained image representations; (b) Combining an LLM (reasoner) and a VLM (image captioner) in a single round fashion, where the VLM simply caption each Bongard image and send their captions to LLM for solving this problem; (c) extending the method in (b) to multiple rounds, where the LLM will also iteratively prob the VLM for more image details, resulting in more condense information for solving Bongard; (d) A neuro-symbolic approach, where a VLM generates the initial captions, then an LLM extracts visual concepts from them. These concepts are subsequently updated through logical operations, leveraging the responses provided by VLM, until the problem is solved.
All the non-LLM models use a ConvNeXt-base image encoder, and we experiment with different pretraining strategies: no pretraining at all (scratch), pretraining with ImageNet-1K labels (IN-1K), pretraining with full ImageNet-22K labels (IN-22k) and pretraining with LAION-2B dataset (OpenCLIP). While the LLM-based models use either BLIP-x or ChatCaptioner captions as the image representations. For the auxiliary captioning task, the model is connected to the caption decoder of a pretrained BLIP-2-opt-6.7B model. *$denotes common_sense. **$involves utilizing the ground truth concepts from Bongard-OpenWorld training set and the captions from BLIP-2 as inputs to fine-tuning ChatGPT over 5 epochs. The fine-tuned model is evaluated on the test set. It is worth noting that InstructBLIP was not fine-tuned due to a significant drop in its performance on ChatGPT.
method | image representation | aux. task? | short concept | long concept | CS* concept | non-CS* concept | avg. |
---|---|---|---|---|---|---|---|
SNAIL | scratch | ✗ | 52.8 | 46.2 | 50.9 | 49.3 | 49.8 |
SNAIL | IN-1K | ✗ | 61.5 | 54.9 | 48.2 | 62.4 | 58.5 |
SNAIL | IN-22K | ✗ | 62.8 | 57.7 | 54.5 | 62.8 | 60.5 |
SNAIL | OpenCLIP | ✗ | 64.2 | 57.7 | 57.3 | 62.8 | 61.3 |
SNAIL | OpenCLIP | ✔ | 66.1 | 61.5 | 63.6 | 64.1 | 64.0 |
OpenFlamingo | OpenCLIP | N/A | 50.0 | 48.4 | 50.9 | 48.6 | 49.3 |
Otter | OpenCLIP | N/A | 49.3 | 49.3 | 48.9 | 49.4 | 49.3 |
ChatGPT | BLIP-2 | N/A | 60.6 | 56.6 | 55.5 | 60.0 | 58.8 |
ChatGPT | InstructBLIP | N/A | 52.1 | 50.6 | 48.1 | 52.7 | 51.4 |
ChatGPT | ChatCaptioner | N/A | 52.3 | 45.6 | 57.3 | 46.2 | 49.3 |
ChatGPT (Fine-tuned)** | BLIP-2 | N/A | 67.0 | 58.8 | 55.5 | 66.2 | 63.3 |
GPT-4 | BLIP-2 | N/A | 64.5 | 58.0 | 57.3 | 63.2 | 61.6 |
GPT-4 | InstructBLIP | N/A | 67.3 | 59.7 | 59.3 | 65.6 | 63.8 |
Neuro-Symbolic | InstructBLIP | N/A | 58.3 | 52.2 | 56.4 | 55.2 | 55.5 |
Human | N/A | N/A | 91.7 | 90.1 | 89.1 | 91.7 | 91.0 |
This codebase can be built from scratch on Ubuntu 20.04 with Python 3.10, PyTorch 1.13 and CUDA 11.7.
conda create -n bongard-ow python=3.10
conda activate bongard-ow
conda install pytorch=1.13 torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
To download all images from the URLs, navigate to the root directory of Bongard-OpenWorld and run scripts/crawl_images.py
.
cd Bongard-OpenWorld
python scripts/crawl_images.py
Please note that some links may be invalid due to the instability of the URLs. To ensure that the community can reproduce our results from scratch, we have provided a backup of all the images. You can download from Google Drive.
The images should be extracted to assets/data/bongard-ow/images
and the file structure looks like:
assets
├── data
│ └── bongard-ow
│ ├── images
│ │ ├── 0000
│ │ ├── 0001
│ │ ├── ....
│ │ └── 1009
│ ├── bbox_data.pkl
│ ├── bongard_ow.json
│ ├── bongard_ow_train.json
│ ├── bongard_ow_val.json
│ └── bongard_ow_test.json
└── weights
Please note that this repository only hosts the code for Bongard-OpenWorld. All images of Bongard-OpenWorld are crawled from Google Images and should not be considered part of the source code.
We do not claim ownership of any image in Bongard-OpenWorld. Therefore, we strongly recommend that you delete all images immediately after benchmarking all approaches and evaluations.
bash fewshot_learning.sh
bash vlm+llm_single_round.sh
bash vlm+llm_multi_round.sh
Thanks for the codebase of ChatCaptioner.
bash neuro-symbolic.sh
bash vlm.sh
Thanks for the codebase of OpenFlamingo and Otter.
- Code: Apache
- Data: CC BY-NC-SA 4.0