Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self-attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.
torch>=1.7.0;torchvision>=0.8.0; pyyaml; timm==0.6.13; einops; fvcore; h5py;
python3 -m torch.distributed.launch --nproc_per_node=3 train_imagenet.py --data {path-to-imagenet} --model {starnet-variants} -b 256 --lr 1e-3 --weight-decay 0.025 --aa rand-m1-mstd0.5-inc1 --cutmix 0.2 --color-jitter 0. --drop-path 0.
Model | Top1 | Ckpt | logs |
---|---|---|---|
VMINet-Ti | 70.7 | ckpt | log |
VMINet-XS | 78.6 | ckpt | log |
VMINet-S | 80.5 | ckpt | log |
VMINet-B | retraining |
The development of this project referenced the source code of StarNet, thanks to this excellent work.
The majority of VMINet is licensed under an Apache License 2.0