Skip to content

Official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024)


Notifications You must be signed in to change notification settings


Repository files navigation

ViTEraser (AAAI 2024)

The official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024). The ViTEraser revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Below are the frameworks of ViTEraser and SegMIM.

ViTEraser SegMIM

Todo List

  • Inference code and model weights
  • ViTEraser training code
  • SegMIM pre-training code


We recommend using Anaconda to manage environments. Run the following commands to install dependencies.

conda create -n viteraser python=3.7 -y
conda activate viteraser
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url
git clone
cd ViTEraser
pip install -r requirements.txt


1. Text Removal Dataset

  • SCUT-EnsText [paper]:

    1. Download the training and testing sets of SCUT-EnsText at link.
    2. Rename all_images and all_labels folders to image and label, respectively.
    3. Generate text masks:
      # Generating masks for the training set of SCUT-EnsText
      python tools/ \
        --data_root data/TextErase/SCUT-EnsText/train    
      # Generating masks for the testing set of SCUT-EnsText
      # Masks are not used for inference. Just keep the same data structure as the training stage.
      python tools/ \
        --data_root data/TextErase/SCUT-EnsText/test

2. SegMIM Pretraining Datasets

(optional, only required by SegMIM pretraining)

Please prepare the above datasets into the data folder following the file structure below.

│  └─SCUT-EnsText
│     ├─train
│     │  ├─image
│     │  ├─label
│     │  └─mask
│     └─test
│        ├─image
│        ├─label
│        └─mask


The download links of pre-trained ViTEraser weights are provided in the following table.

Name BaiduNetDisk GoogleDrive
ViTEraser-Tiny link link
ViTEraser-Small link link
ViTEraser-Base link link


The example command for the inference with ViTEraser-Tiny is:

python -m torch.distributed.launch \
        --master_port=3151 \
        --nproc_per_node 1 \
        --use_env \ \
        --eval \
        --data_root data/TextErase/ \
        --val_dataset scutens_test \
        --batch_size 1 \
        --encoder swinv2 \
        --decoder swinv2 \
        --pred_mask false \
        --intermediate_erase false \
        --swin_enc_embed_dim 96 \
        --swin_enc_depths 2 2 6 2 \
        --swin_enc_num_heads 3 6 12 24 \
        --swin_enc_window_size 16 \
        --swin_dec_depths 2 6 2 2 2 \
        --swin_dec_num_heads 24 12 6 3 2 \
        --swin_dec_window_size 16 \
        --output_dir path/to/save/output/ \
        --resume path/to/weights/

Argument changes for different scales of ViTEraser are as below:

Argument Tiny Small Base
swin_enc_embed_dim 96 96 128
swin_enc_depths 2 2 6 2 2 2 18 2 2 2 18 2
swin_enc_num_heads 3 6 12 24 3 6 12 24 4 8 16 32
swin_enc_window_size 16 16 8
swin_dec_depths 2 6 2 2 2 2 18 2 2 2 2 18 2 2 2
swin_dec_num_heads 24 12 6 3 2 24 12 6 3 2 32 16 8 4 2
swin_dec_window_size 16 8 8


The command for calculating metrics is:

python eval/ \
    --gt_path data/TextErase/SCUT-EnsText/test/label/ \
    --target_path path/to/model/output/

python -m pytorch_fid \
    data/TextErase/SCUT-EnsText/test/label/ \
    path/to/model/output/ \
    --device cuda:0

ViTEraser Training

1. Training without SegMIM pretraining

  • Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo).
  • Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
  • Put the pretrained weights into the pretrained folder.
  • Run the example scripts in the scripts/viteraser-training-wosegmim folder. For instance, run the following command to train ViTEraser-Tiny without SegMIM pretraining.
bash scripts/viteraser-training-wosegmim/

2. Training with SegMIM pretraining

  • Download the SegMIM pretraining weights for ViTEraser-Tiny (download link), ViTEraser-Small (download link), or ViTEraser-Base (download link).
  • Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
  • Put the pretrained weights into the pretrained folder.
  • Run the example scripts in the scripts/viteraser-training-withsegmim folder. For instance, run the following command to train ViTEraser-Tiny with SegMIM pretraining.
bash scripts/viteraser-training-withsegmim/

SegMIM Pretraining

  • Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo) into the pretrained folder.
  • Run the example scripts in the scripts/segmim folder. For instance, run the following command to perform SegMIM pretraining of ViTEraser-Tiny.
# end-to-end encoder-decoder pretraining
bash scripts/segmim/

# standalone encoder finetuning
bash scripts/segmim/


  title={ViTEraser: Harnessing the power of vision transformers for scene text removal with SegMIM pretraining},
  author={Peng, Dezhi and Liu, Chongyu and Liu, Yuliang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},


This repository can only be used for non-commercial research purpose.

For commercial use, please contact Prof. Lianwen Jin (

Copyright 2024, Deep Learning and Vision Computing Lab, South China University of Technology.


Official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024)







No releases published


No packages published