MAE-Lite (IJCV 2025)
News | Introduction | Getting Started | Main Results | Citation | Acknowledge
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
Jin Gao, Shubo Lin, Shaoru Wang*, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
IJCV 2025
A Closer Look at Self-Supervised Lightweight Vision Transformers
Shaoru Wang, Jin Gao*, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023
2024.12
: Our extended version is accepted by IJCV 2025!2023.5
: Code & models are released!2023.4
: Our paper is accepted by ICML 2023!2022.5
: Our initial version of the paper was published on Arxiv.
MAE-Lite focuses on exploring the pre-training of lightweight Vision Transformers (ViTs). This repo provide the code and models for the study in the paper.
- We provide advanced pre-training (based on MAE) and fine-tuning recipes for lightweight ViTs and demonstrate that even vanilla lightweight ViT (e.g., ViT-Tiny) beats most previous SOTA ConvNets and ViT derivatives with delicate network architecture design. We achieve 79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M).
- We provide code for the transfer evaluation of pre-trained models on several classification tasks (e.g., Oxford 102 Flower, Oxford-IIIT Pet, FGVC Aircraft, CIFAR, etc.) and COCO detection tasks (based on ViTDet). We find that the self-supervised pre-trained ViTs work worse than the supervised pre-trained ones on data-insufficient downstream tasks.
- We provide code for the analysis tools used in the paper to examine the layer representations and attention distance & entropy for the ViTs.
- We provide code and models for our proposed knowledge distillation method for the pre-trained lightweight ViTs based on MAE, which shows superiority on the trasfer evaluation of data-insufficient classification tasks and dense prediction tasks.
update (2025.02.28)
- We provide benchmark for more masked image modeling (MIM) pre-training methods (BEiT, BootMAE, MaskFeat) on lightweight ViTs and evaluate their transferability to downstream tasks.
- We provide code and models for our decoupled distillation method during pre-training and transfer to more dense prediction tasks including detection, tracking and semantic segmentation, which enables SOTA performance on the ADE20K segmentation task (42.8% mIoU) and LaSOT tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
- We extend our distillation method to hierarchical ViTs (Swin and Hiera), which validate the generalizability and effectiveness of the distillation following our observation-analysis-solution flow.
Setup conda environment:
# Create environment
conda create -n mae-lite python=3.7 -y
conda activate mae-lite
# Instaill requirements
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -y
# Clone MAE-Lite
git clone https://github.com/wangsr126/mae-lite.git
cd mae-lite
# Install other requirements
pip3 install -r requirements.txt
python3 setup.py build develop --user
Prepare the ImageNet data in <BASE_FOLDER>/data/imagenet/imagenet_train
, <BASE_FOLDER>/data/imagenet/imagenet_val
.
To pre-train ViT-Tiny with our recommended MAE recipe:
# 4096 batch-sizes on 8 GPUs:
cd projects/mae_lite
ssl_train -b 4096 -d 0-7 -e 400 -f mae_lite_exp.py --amp \
--exp-options exp_name=mae_lite/mae_tiny_400e
Please download the pre-trained models, e.g.,
download MAE-Tiny to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
To fine-tune with the improved recipe:
# 1024 batch-sizes on 8 GPUs:
cd projects/eval_tools
ssl_train -b 1024 -d 0-7 -e 300 -f finetuning_exp.py --amp \
[--ckpt <checkpoint-path>] --exp-options pretrain_exp_name=mae_lite/mae_tiny_400e
<checkpoint-path>
: if set to<BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
, it will be loaded as initialization; If not set, the checkpoint at<BASE_FOLDER>/outputs/mae_lite/mae_tiny_400e/last_epoch_ckpt.pth.tar
will be loaded automatically.
download MAE-Tiny-FT to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_eval
And you will get "Top1: 77.978"
if all right.
download MAE-Tiny-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval
And you will get "Top1: 79.002"
if all right.
download MAE-Tiny-Distill-DΒ²-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval qv_bias=False
And you will get "Top1: 79.444"
if all right.
Please refer to DISTILL.md.
Please refer to TRANSFER.md.
Please refer to DETECTION.md.
Please refer to TRACKING.md.
Please refer to SEGMENTATION.md.
Please refer to MOCOV3.md.
Please refer to VISUAL.md.
pre-train code | pre-train epochs |
fine-tune recipe | fine-tune epoch | accuracy | ckpt |
---|---|---|---|---|---|
- | - | impr. | 300 | 75.8 | link |
mae_lite | 400 | - | - | - | link |
impr. | 300 | 78.0 | link | ||
impr.+RPE | 1000 | 79.0 | link | ||
mae_lite_distill | 400 | - | - | - | link |
impr. | 300 | 78.4 | link | ||
mae_lite_d2_distill | 400 | - | - | - | link |
impr. | 300 | 78.7 | link | ||
impr.+RPE | 1000 | 79.4 | link |
Please cite the following paper if this repo helps your research:
@misc{wang2023closer,
title={A Closer Look at Self-Supervised Lightweight Vision Transformers},
author={Shaoru Wang and Jin Gao and Zeming Li and Xiaoqin Zhang and Weiming Hu},
journal={arXiv preprint arXiv:2205.14443},
year={2023},
}
@article{gao2025experimental,
title={An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training},
author={Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu},
journal={International Journal of Computer Vision},
year={2025},
doi={10.1007/s11263-024-02327-w},
publisher={Springer}
}
We thank for the code implementation from timm, MAE, MoCo-v3.
This repo is released under the Apache 2.0 license. Please see the LICENSE file for more information.