Yunpeng Qu1,2 | Kun Yuan2 | Jinhua Hao2 | Kai Zhao2 | Qizhi Xie1,2 | Ming Sun2 | Chao Zhou2
1Tsinghua University, 2Kuaishou Technology.
Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods.
## git clone this repository
git clone https://github.com/qyp2000/VARSR.git
cd VARSR
# create an environment with python >= 3.9
conda create -n varsr python=3.9
conda activate varsr
pip install -r requirements.txt
- Download VARSR and VQVAE model from
and put it into
checkpoints/
. - Prepare testing LR images in the
testset
, e.g.,testset/{folder_path}/LR
.
- To generate standard 512*512 images:
python test_varsr.py
You can modify the parameters to adapt to your specific need, such as the cfg
which is set to 6.0 by default.
- To generate high-resolution images:
python test_tile.py
You can modify the parameters to adapt to your specific need, such as the cfg
which is set to 7.0 by default and the super-resolution scale
which is set to 4.0 by default.
- Download VQVAE model from
and put it into
checkpoints/
. - Download pretrained original VAR models from VAR and put them into
checkpoints/
. You can also use the C2I VARSR pretrained on our large-scale dataset, which can be downloaded from.
- Prepare your own training images into
trainset
, e.g.,trainset/{folder_path}
. And you can put your negative samples intotrainset_neg
, e.g.,trainset_neg/{folder_path}
. More changes to the dataset path can be done in the filedataloader/localdataset_lpm.py
.
torchrun --nproc-per-node=8 train.py --depth=24 --batch_size=4 --ep=5 --fp16=1 --tblr=5e-5 --alng=1e-4 --wpe=0.01 --wandb_flag=True --fuse=0 --exp_name='VARSR'
You can modify the parameters in utils/arg_util.py
to adapt to your specific need, such as the batch_size
and the learning_rate
.
We also provide pretrained Class-to-Image model weights and inference code to contribute more to the academic community.
python test_C2I.py
Our dataset contains 3830 semantic categories, and you can adjust the classes
to generate images corresponding to each category.
If our work is useful for your research, please consider citing and give us a star ⭐:
@article{qu2025visual,
title={Visual Autoregressive Modeling for Image Super-Resolution},
author={Qu, Yunpeng and Yuan, Kun and Hao, Jinhua and Zhao, Kai and Xie, Qizhi and Sun, Ming and Zhou, Chao},
journal={arXiv preprint arXiv:2501.18993},
year={2025}
}
Please feel free to contact: qyp21@mails.tsinghua.edu.cn
.
I am very pleased to communicate with you and will maintain this repository during my free time.
Some codes are brought from VAR, MAR and HART. Thanks for their excellent works.