Skip to content

Latest commit

 

History

History
155 lines (103 loc) · 4.15 KB

README.md

File metadata and controls

155 lines (103 loc) · 4.15 KB

OVMR: Open-Vocabulary Recognition with Multi-Modal References

[Paper]

This repo is official implementation of OVMR: Open-Vocabulary Recognition with Multi-Modal References.

Abstract

The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works embed category cues into model through few-shot fine-tuning or providing textual descriptions to vision-language models. Few-shot fine-tuning with exemplar images is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. Our finetuning-free OVMR embed multi-modal category clues into vision-language models with two plug-and-play modules.

Highlights

  • [2024.09.30] The code for open-vocabulary classification has been released.

OVMR Architecture

Environment

git clone https://github.com/Zehong-Ma/OVMR.git
cd OVMR
conda create -n ovmr python=3.10
conda activate ovmr
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements
cd ./Dassl.pytorch
pip install -e .

Train

Data

  • Download imagenet21k datasets and extract files.

  • Create a soft link of imagenet21k in data folder

    mkdir -p ./data/imagenet21k/
    ln -s /path/to/imagenet21k ./data/imagenet21k/images
  • download the imagenet21k-OVR from here. Put imagenet21k_OVR_classnames.txt into ./data/imagenet21k/ and shot_64-seed_1.pkl into ./data/imagenet21k/split_fewshot/. The structure will look like:

    data
    |–– imagenet21k/
    |   |–– imagenet21k_OVR_classnames.txt
    |   |–– split_fewshot/
    |   |   |–– shot_64-seed_1.pkl
    |   |–– images/
    |   |   |––n00004475
    |   |   |––n00005787
    |   |   |––...

Scripts

  • Change the available GPU ID in the training script.

  • Run the following script to reproduce our results in open vocabulary classification.

    sh train.sh

Inference

Data

  • Follow DATASETS.md to install the datasets and put these datasets into data folder. Thanks for the awesome work CoOp.

  • If you have downloaded these datasets before, please create a soft link through:

    ln -s /root/path/to/CoOp/dataset/* ./data/
    
  • The final structure will look like:

    data
    |–– imagenet21k/
    |–– imagenet/
    |   |   |–– images/
    |   |   |–– train/ # contains 1,000 folders like n01440764, n01443537, etc.
    |   |   |–– val/
    |   |   |--classnames.txt
    |–– caltech-101/
    |–– dtd/
    |–– eurosat/
    |–– ...
    

Scripts

  • The modefusion, vision, multimodal represent the final fused classifier, the vision-basd classifier, and the multi-modal classifier, respectively.

  • Get the open vocabulary classification results in prompt learning setup by:

    sh eval.sh

Results

Open-Vocabulary Classification

  • Few-Shot Prompt Learning Methods

  • Traditional Few-shot Methods

Open-Vocabulary Detection

The Influence of Exemplar Image's Number.

Citation

@InProceedings{Ma_2024_CVPR,
    author    = {Ma, Zehong and Zhang, Shiliang and Wei, Longhui and Tian, Qi},
    title     = {OVMR: Open-Vocabulary Recognition with Multi-Modal References},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {16571-16581}
}