Yuehao Song1 , Xinggang Wang1 📧 , Jingfeng Yao1 , Wenyu Liu1 , Jinglin Zhang2 , Xiangmin Xu3
1 Huazhong University of Science and Technology, 2 Shandong University, 3 South China University of Technology
(📧) corresponding author.
ArXiv Preprint (arXiv 2403.12778)
Mar. 25th, 2024
: We release an initial version of ViTGaze.Mar. 19th, 2024
: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, ViTGaze. In contrast to previous methods, it creates a brand new gaze following framework based mainly on powerful encoders (relative decoder parameter less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less.
Results from the ViTGaze paper
Results on GazeFollow | Results on VideoAttentionTarget | ||||
---|---|---|---|---|---|
AUC | Avg. Dist. | Min. Dist. | AUC | Dist. | AP |
0.949 | 0.105 | 0.047 | 0.938 | 0.102 | 0.905 |
Corresponding checkpoints are released:
- GazeFollow: GoogleDrive
- VideoAttentionTarget: GoogleDrive
ViTGaze is based on detectron2. We use the efficient multi-head attention implemented in the xFormers library.
If you find ViTGaze is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{vitgaze,
title={ViTGaze: Gaze Following with Interaction Features in Vision Transformers},
author={Yuehao Song and Xinggang Wang and Jingfeng Yao and Wenyu Liu and Jinglin Zhang and Xiangmin Xu},
journal={arXiv preprint arXiv:2403.12778},
year={2024}
}