English | 简体中文
Vision Transformer 论文精选列表和总结。
你可以使用思维导图软件打开 思维导图源文件,如果你只想浏览图片,则可以下载 思维导图高清图
每个类别中仅列出典型算法
中文博客链接
- Vision Transformer 必读系列之图像分类综述(一):概述
- Vision Transformer 必读系列之图像分类综述(二): Attention-based
- Vision Transformer 必读系列之图像分类综述(三): MLP、ConvMixer 和架构分析
- [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
- [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Image to Token 包括:
-
非重叠 Patch Embedding
-
重叠 Patch Embedding
-
[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
-
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
-
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
-
[ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
-
[PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
-
Token to Token 包括:
- 固定采样窗口 Token 化
- 动态采样 Token 化
显式位置编码包括:
- 绝对位置编码
- 相对位置编码
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
隐式位置编码包括:
- [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
- [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
- [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
仅包括全局注意力包括:
-
标准多头注意力模块
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
-
减少全局注意力计算量
-
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
-
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
-
[Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
-
[P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
-
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
-
[MViT] Multiscale Vision Transformers (2021.4) [Paper]
-
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
-
-
广义线性注意力
- [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
引入额外局部注意力包括:
-
局部窗口计算模式
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
- [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
- [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
- [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
- [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
- [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
-
引入卷积局部归纳偏置
-
稀疏注意力
- [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
通过 Conv 局部信息提取能力,提升性能包括:
- [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
- [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
-
Pre Normalization
-
Post Normalization
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
-
Class Tokens
-
Avgerage Pooling
(1) 如何输出多尺度特征图
-
Patch Merging
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
- [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
- [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
- [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
-
Pooling Attention
-
[MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]
-
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
-
空洞卷积
- [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
(2) 如何训练更深的 Transformer
- [Cait] Going deeper with Image Transformers (2021.3) [Paper]
- [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]
-
[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
-
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
-
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
-
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]
- [ConvMixer] Patches Are All You Need [Paper]
- Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
- A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
- [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
- [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]
- [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
- [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
-
[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
-
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
-
[CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
-
[TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
-
[Cait] Going deeper with Image Transformers (2021.3) [Paper]
-
[DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]
-
[Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
-
[CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
-
[LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
-
[MViT] Multiscale Vision Transformers (2021.4) [Paper]
-
[Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
-
[Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
-
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
-
[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
-
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
-
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
-
[MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
-
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
-
[TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
-
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
-
[P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
-
[GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
-
[Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
-
[ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
-
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]
-
[CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
-
[PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
-
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
-
[Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
-
[MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
-
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
-
[ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
-
[ConvMixer] Patches Are All You Need [Paper]
- [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]
后续会持续更新,欢迎提 PR!