Official repository for the paper "Salient Mask-Guided Vision Transformer for Fine-Grained Classification",
accepted as a Full Paper to VISAPP '23 (part of VISIGRAPP '23).
Salient Mask-Guided Vision Transformer for Fine-Grained Classification
Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham Cholakkal, Fahad Shahbaz Khan
Main Architecture | Attention guiding (see Eq. 3) |
---|---|
- | blue, green, red bars - attention to salient patches |
In this work, we introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC. Our method, named SalientMask-Guided Vision Transformer (SM-ViT), utilises a salient object detection module comprising an off-the-shelf saliency detector to produce a salient mask likely focusing on the potentially discriminative foreground object regions in an image. The saliency mask is then utilised within our ViT-like Salient Mask-Guided Encoder (SMGE) to boost the discriminabil-ity of the standard self-attention mechanism, thereby focusing on more distinguishable tokens.
Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to distinguish potentially discriminative regions while disregarding the rest. However, such approaches may struggle to effectively focus on truly discriminative regions due to only relying on the inherent self-attention mechanism, resulting in the classification token likely aggregating global information from less-important background patches. Moreover, due to the immense lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, we introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT's attention maps is boosted through salient masking of potentially discriminative foreground regions. Extensive experiments demonstrate that with the standard training procedure our SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.
- We introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC.
- To the best of our knowledge, we are the first to explore the effective utilisation of saliency masks in order to extract more distinguishable information within the ViT encoder layers by boosting the discriminability of self-attention features for the FGVC task.
- Our extensive experiments on three popular FGVC datasets (Stanford Dogs, CUB, and NABirds) demonstrate that with the standard training procedure the proposed SM-ViT achieves state-of-the-art performance.
- Important advantage of our solution is its integrability, since it can be fine-tuned on top of a ViT-based backbone or can be integrated into a Transformer-like architecture that leverages the standard self-attention mechanism.
All models in our experiments are first initialised with publicly available pre-trained ViT/B-16 model's weights and then fine-tuned on the corresponding datasets.
Model | Baseline | Input Size | St. Dogs | Weights | CUB-200 | Weights | NABirds | Weights |
---|---|---|---|---|---|---|---|---|
Vanilla ViT | ViT-B/16 | 448x448 | 91.4 | - | 90.6 | - | 89.6 | - |
SM-ViT (ours) |
ViT-B/16 | 400x400 | 92.3 | link | 91.6 | link | 90.2 | link |
SM-ViT (ours) |
ViT-B/16 | 448x448 | 90.5 | link | ||||
SM-ViT (ours) |
ViT-B/16 | 560x560 | 90.7 | link |
Model | Input Size | St. Dogs | Weights | CUB-200 | Weights | NABirds | Weights |
---|---|---|---|---|---|---|---|
SM-ViT + Advanced guiding |
400x400 | - | - | 91.7 | link | 90.7 | link |
For environment installation and pre-trained models preparation, please follow the instructions in INSTALL.md.
For datasets preparation, please follow the instructions in DATASET.md.
For training and evaluation, please follow the instructions in RUN.md.
-
(Dec 20, 2022)
- Repo description added (README.md).
-
(Dec 30, 2022)
- Pretrained models are released.
- Code instructions added (INSTALL.md, DATASET.md, RUN.md).
-
(Jan 09, 2023)
- Training and evaluation code is released.
-
(Soon)
- Optimisation
In case you would like to utilise or refer to our approach (source code, trained models, or results) in your research, please consider citing:
@conference{demidov2022smvit,
author={Dmitry Demidov. and Muhammad Sharif. and Aliakbar Abdurahimov. and Hisham Cholakkal. and Fahad Khan.},
title={Salient Mask-Guided Vision Transformer for Fine-Grained Classification},
booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,},
year={2023},
pages={27-38},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011611100003417},
isbn={978-989-758-634-7},
issn={2184-4321},
}
In case you have a question or suggestion, please create an issue or contact us at dmitry.demidov@mbzuai.ac.ae .
Our code is partially based on ViT-pytorch, U2N, and FFVT repositories and we thank the corresponding authors for releasing their code. If you use our derived code, please consider giving credits to these works as well.