Official implementation of the paper "PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers", accepted as an Oral presentation at ECCV 2024.
[Oral
][🤗 Space
][Paper
] [Supp.
] [Arxiv
] [🤗 Page
]
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
- The code has been updated to support the NABirds dataset. The corresponding evaluation metrics and pre-trained models have also been added.
- The models are available via torch hub. The details can be found in the model zoo file.
- PDiscoformer has been accepted as an Oral presentation at ECCV 2024 🎉
- Models are now available via HuggingFace. Thanks to Niels Rogge and Merve Noyan.
To install the required packages, run the following command:
conda env create -f environment.yml
Otherwise, you can also individually install the following packages:
- PyTorch: Tested upto version 2.3, please raise an issue if you face any problems with more recent versions.
- Colorcet
- Matplotlib
- OpenCV
- Pandas
- Scikit-Image
- Scikit-Learn
- TorchMetrics
- timm
- wandb: It is recommended to create an account and use it for tracking the experiments. Use the '--wandb' flag when running the training script to enable this feature.
- pycocotools
- pytopk
- huggingface-hub
The dataset can be downloaded from here.
The folder structure should look like this:
CUB_200_2011
├── attributes
├── bounding_boxes.txt
├── classes.txt
├── images
├── image_class_labels.txt
├── images.txt
├── parts
├── README
└── train_test_split.txt
The dataset can be downloaded from here. After downloading the dataset, use the pre-processing script (prepare_partimagenet_ood.py) and train-test split (data_sets/train_test_split_pimagenet_ood.txt) to generate the required annotation files for training and evaluation. The command to run the pre-processing script is as follows:
python prepare_partimagenet_ood.py --anno_path <path to train.json file> --output_dir <path to save the train and test json file> --train_test_split_file data_sets/train_test_split_pimagenet_ood.txt
The dataset is automatically downloaded by the training script with the required folder structure (except for the segmentation masks). If you want to evaluate the foreground segmentation on the dataset, please download the segmentations from here. The final folder structure should look like this:
(root folder)
├── flowers-102 (folder containing the dataset created automatically by the training script)
├── segmim (folder containing the segmentation masks)
├── jpg
├── imagelabels.mat
└── setid.mat
The dataset can be downloaded from here. No additional pre-processing is required.
The dataset can be downloaded from here. The experiments on this dataset are not present in the paper as they were conducted after the paper was submitted. The folder structure should look like this (essentially the same as CUB except for the attributes):
nabirds
├── bounding_boxes.txt
├── classes.txt
├── images
├── image_class_labels.txt
├── images.txt
├── parts
├── hierarchy.txt
├── README
└── train_test_split.txt
The details of running the training script can be found in the training instructions file.
The details of running the evaluation metrics for both classification and part discovery can be found in the evaluation instructions file.
The trained models can be found in the model zoo file.
Feel free to raise an issue if you face any problems with the code or have any questions about the paper.
If you find our work useful in your research, please consider citing:
@inproceedings{aniraj2024pdiscoformer,
title = {{PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers}},
author = {Aniraj, Ananthu and Dantas, Cassio F. and Ienco, Dino and Marcos, Diego},
booktitle = {{ECCV 2024 - 18th European Conference on Computer Vision}},
year = {2024},
publisher = {{Springer Nature Switzerland}},
series = {Lecture Notes in Computer Science},
volume = {15143},
pages = {256-272},
doi = {10.1007/978-3-031-73013-9\_15},
}