Haoran Wei*, Youyang Yin*, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang
Accurate copying is the first step to visual o1!
- [2024/12/31]🔥🔥🔥 The paper can be found in Arxiv.
- [2024/12/24]🔥🔥🔥 We release the slow perception! The paper can be found here temporarily and we will submit it to arxiv after we completing the appendix part.
- The codebase is based on GOT-OCR2.0, and if you have installed the GOT environment, use the GOT conda is OK.
- Clone this repository and navigate to the Slow-Perception-master folder
git clone https://github.com/Ucas-HaoranWei/Slow-Perception.git
cd 'Slow-Perception-master'
- Install Package
conda create -n sp python=3.10 -y
conda activate sp
pip install -e .
- Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation
- Download the SP-1/weights.zip to Slow-Perception-master
unzip weights.zip
- We provide the baseline and 4-length perceptual ruler weights.
- Download the SP-1/train_sp1.zip and all SP-1/*.json to Slow-Perception-master for train
unzip train_sp1.zip
- Download the SP-1/benchmarks.zip to Slow-Perception-master for eval.
unzip benchmarks.zip
Note: The folders hierarchy are as follows:
--Slow-Perception-master
--SP-1
--SP
--...
python3 SP/demo/run_jihe_parsing.py --model-name SP-1/weights/4ruler/ --image-file SP-1/benchmarks/val_set/
python3 calculate_f1.py
If you want to input a single image:
python3 SP/demo/run_jihe_parsing.py --model-name SP-1/weights/4ruler/ --image-file results/jihe_demo.jpg
- Download the GOT weights .
deepspeed SP/train/train_SP.py \
--deepspeed zero_config/zero2.json \
--model_name_or_path /GOT_weights/ \
--freeze_vision_tower False \
--freeze_lm_model False \
--vision_select_layer -2 \
--use_im_start_end True \
--fp16 True \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--weight_decay 0. \
--warmup_ratio 0.003 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--report_to none \
--per_device_train_batch_size 2 \
--num_train_epochs 2 \
--learning_rate 3e-5 \
--datasets SP-1 \
--output_dir jihe_sp_4ruler/ \
Don't hesitate to contact me by email, weihaoran18@mails.ucas.ac.cn, if you have any questions.
- GOT-OCR2.0: the codebase we built upon!
@article{wei2024slow,
title={Slow Perception: Let's Perceive Geometric Figures Step-by-step},
author={Wei, Haoran and Yin, Youyang and Li, Yumeng and Wang, Jia and Zhao, Liang and Sun, Jianjian and Ge, Zheng and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2412.20631},
year={2024}
}
@article{wei2024general,
title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
journal={arXiv preprint arXiv:2409.01704},
year={2024}
}