ViTEraser (AAAI 2024)

The official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024). The ViTEraser revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Below are the frameworks of ViTEraser and SegMIM.

Todo List

Inference code and model weights
ViTEraser training code
SegMIM pre-training code

Environment

We recommend using Anaconda to manage environments. Run the following commands to install dependencies.

conda create -n viteraser python=3.7 -y
conda activate viteraser
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
git clone https://github.com/shannanyinxiang/ViTEraser.git
cd ViTEraser
pip install -r requirements.txt

Datasets

1. Text Removal Dataset

SCUT-EnsText [paper]:

Download the training and testing sets of SCUT-EnsText at link.
Rename all_images and all_labels folders to image and label, respectively.
Generate text masks:

  # Generating masks for the training set of SCUT-EnsText
  python tools/generate_mask.py \
    --data_root data/TextErase/SCUT-EnsText/train    

  # Generating masks for the testing set of SCUT-EnsText
  # Masks are not used for inference. Just keep the same data structure as the training stage.
  python tools/generate_mask.py \
    --data_root data/TextErase/SCUT-EnsText/test

2. SegMIM Pretraining Datasets

(optional, only required by SegMIM pretraining)

ICDAR2013 [paper][download link]
ICDAR2015 [paper][download link]
MLT2017 [paper][download link]
ArT [paper][download link]
LSVT [paper][download link]
ReCTS [paper][download link]
TextOCR [paper][download link]

Please prepare the above datasets into the data folder following the file structure below.

data
├─TextErase
│  └─SCUT-EnsText
│     ├─train
│     │  ├─image
│     │  ├─label
│     │  └─mask
│     └─test
│        ├─image
│        ├─label
│        └─mask
└─SegMIMDatasets
   ├─ArT
   ├─ICDAR2013
   ├─ICDAR2015
   ├─LSVT
   ├─MLT2017
   ├─ReCTS
   └─TextOCR

Models

The download links of pre-trained ViTEraser weights are provided in the following table.

Name	BaiduNetDisk	GoogleDrive
ViTEraser-Tiny	link	link
ViTEraser-Small	link	link
ViTEraser-Base	link	link

Inference

The example command for the inference with ViTEraser-Tiny is:

CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
        --master_port=3151 \
        --nproc_per_node 1 \
        --use_env \
        main.py \
        --eval \
        --data_root data/TextErase/ \
        --val_dataset scutens_test \
        --batch_size 1 \
        --encoder swinv2 \
        --decoder swinv2 \
        --pred_mask false \
        --intermediate_erase false \
        --swin_enc_embed_dim 96 \
        --swin_enc_depths 2 2 6 2 \
        --swin_enc_num_heads 3 6 12 24 \
        --swin_enc_window_size 16 \
        --swin_dec_depths 2 6 2 2 2 \
        --swin_dec_num_heads 24 12 6 3 2 \
        --swin_dec_window_size 16 \
        --output_dir path/to/save/output/ \
        --resume path/to/weights/

Argument changes for different scales of ViTEraser are as below:

Argument	Tiny	Small	Base
swin_enc_embed_dim	96	96	128
swin_enc_depths	2 2 6 2	2 2 18 2	2 2 18 2
swin_enc_num_heads	3 6 12 24	3 6 12 24	4 8 16 32
swin_enc_window_size	16	16	8
swin_dec_depths	2 6 2 2 2	2 18 2 2 2	2 18 2 2 2
swin_dec_num_heads	24 12 6 3 2	24 12 6 3 2	32 16 8 4 2
swin_dec_window_size	16	8	8

Evaluation

The command for calculating metrics is:

python eval/evaluation.py \
    --gt_path data/TextErase/SCUT-EnsText/test/label/ \
    --target_path path/to/model/output/

python -m pytorch_fid \
    data/TextErase/SCUT-EnsText/test/label/ \
    path/to/model/output/ \
    --device cuda:0

ViTEraser Training

1. Training without SegMIM pretraining

Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo).
Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
Put the pretrained weights into the pretrained folder.
Run the example scripts in the scripts/viteraser-training-wosegmim folder. For instance, run the following command to train ViTEraser-Tiny without SegMIM pretraining.

bash scripts/viteraser-training-wosegmim/viteraser-tiny-train.sh

2. Training with SegMIM pretraining

Download the SegMIM pretraining weights for ViTEraser-Tiny (download link), ViTEraser-Small (download link), or ViTEraser-Base (download link).
Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
Put the pretrained weights into the pretrained folder.
Run the example scripts in the scripts/viteraser-training-withsegmim folder. For instance, run the following command to train ViTEraser-Tiny with SegMIM pretraining.

bash scripts/viteraser-training-withsegmim/viteraser-tiny-train-withsegmim.sh

SegMIM Pretraining

Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo) into the pretrained folder.
Run the example scripts in the scripts/segmim folder. For instance, run the following command to perform SegMIM pretraining of ViTEraser-Tiny.

# end-to-end encoder-decoder pretraining
bash scripts/segmim/viteraser-tiny-segmim.sh

# standalone encoder finetuning
bash scripts/segmim/viteraser-tiny-encoder-finetune.sh

Citation

@inproceedings{peng2024viteraser,
  title={ViTEraser: Harnessing the power of vision transformers for scene text removal with SegMIM pretraining},
  author={Peng, Dezhi and Liu, Chongyu and Liu, Yuliang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={5},
  pages={4468--4477},
  year={2024}
}

Copyright

This repository can only be used for non-commercial research purpose.

For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
engine		engine
eval		eval
figures		figures
models		models
optim		optim
scripts		scripts
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
main_segmim.py		main_segmim.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTEraser (AAAI 2024)

Todo List

Environment

Datasets

1. Text Removal Dataset

2. SegMIM Pretraining Datasets

Models

Inference

Evaluation

ViTEraser Training

1. Training without SegMIM pretraining

2. Training with SegMIM pretraining

SegMIM Pretraining

Citation

Copyright

About

Releases

Packages

Languages

License

shannanyinxiang/ViTEraser

Folders and files

Latest commit

History

Repository files navigation

ViTEraser (AAAI 2024)

Todo List

Environment

Datasets

1. Text Removal Dataset

2. SegMIM Pretraining Datasets

Models

Inference

Evaluation

ViTEraser Training

1. Training without SegMIM pretraining

2. Training with SegMIM pretraining

SegMIM Pretraining

Citation

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages