ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

[paper]

Abstract: Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC.

🎉 News

2024/11/18 Our paper and code are publicly available.

Dependencies

Our code is built on top of MMSegmentation. Please follow the instructions to install MMSegmentation. We used Python=3.9.17, torch=2.0.1, mmcv=2.1.0, and mmseg=1.2.2 in our experiments.

Datasets

We support four segmentation benchmarks: COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. For the dataset preparation, please follow the MMSeg Dataset Preparation document. The COCO-Object dataset can be derived from COCO-Stuff by running the following command

python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K

Additional datasets can be seamlessly integrated following the same dataset preparation document. Please modify the dataset (data_root) and class name (name_path) paths in the config files.

LLaMa Generated Texts

For reproducibility, we provide the LLM-generated auxiliary texts. Please update the auxiliary path (auxiliary_text_path) in the config files. We also provide the definition and synonym generation codes (llama3_definition_generation.pyand llama3_synonym_generation.py). For the supported datasets, running these files is unnecessary, as we have already included the LLaMA-generated texts.

Evaluation

To evaluate ITACLIP on a dataset, run the following command updating the dataset_name.

python eval.py --config ./configs/cfg_{dataset_name}.py

Demo

To evaluate ITACLIP on a single image, run the demo.ipynb Jupyter Notebook

Results

With the default configurations, you should achieve the following results (mIoU).

Dataset	mIoU
COCO-Stuff	27.0
COCO-Object	37.7
PASCAL VOC	67.9
PASCAL Context	37.5

Citation

If you find our project helpful, please consider citing our work.

@article{aydin2024itaclip,
  title={ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements},
  author={Ayd{\i}n, M Arda and {\c{C}}{\i}rpar, Efe Mert and Abdinli, Elvin and Unal, Gozde and Sahin, Yusuf H},
  journal={arXiv preprint arXiv:2411.12044},
  year={2024}
}

Acknowledgments

This implementation builds upon CLIP, SCLIP, and MMSegmentation. We gratefully acknowledge their valuable contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
clip		clip
configs		configs
datasets		datasets
figs		figs
llama_generated_texts		llama_generated_texts
prompts		prompts
README.md		README.md
custom_datasets.py		custom_datasets.py
demo.ipynb		demo.ipynb
demo.jpg		demo.jpg
eval.py		eval.py
itaclip_segmentor.py		itaclip_segmentor.py
llama3_definition_generation.py		llama3_definition_generation.py
llama3_synonym_generation.py		llama3_synonym_generation.py
pamr.py		pamr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

🎉 News

Dependencies

Datasets

LLaMa Generated Texts

Evaluation

Demo

Results

Citation

Acknowledgments

About

Releases

Packages

Languages

m-arda-aydn/ITACLIP

Folders and files

Latest commit

History

Repository files navigation

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

🎉 News

Dependencies

Datasets

LLaMa Generated Texts

Evaluation

Demo

Results

Citation

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages