- The Huggingface version of Vary-tiny suffers potential issues, leading to the loss being hard to converge under multiple epochs.
- Many friends are very interested in the train data of Vary.
- [2024/9/03] 🔥🔥🔥 We release a very strong and comprehensive OCR model GOT-OCR2.0.
- [2024/4/21] 🔥🔥🔥 For OneChart, we have released the web demo in Project Page. Have fun!!
- [2024/4/21] 🔥🔥🔥 We present a Vary-tiny LAVIS codebase and the Vary-600k dataset !!!
Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only.
- Clone this repository and navigate to the Vary-tiny-600k folder
git clone https://github.com/Ucas-HaoranWei/Vary-tiny-600k.git
cd LAVIS-main
- Install Package
pip install -e .
- Prepare Pretrain Weights and Data
- download the OPT-125M here and the SAM-b weights here
- download the Vary-600k here with code "vary"
- prepare the dirs as follows:
python -m torch.distributed.run --nproc_per_node=8 --master_port=29501 train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml
or multi machines
python -m torch.distributed.run --master_addr xxx --master_port xxx --node_rank xxx --nnodes xxx --nproc_per_node xxx train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml
If your training goes smoothly, your loss (end of each epoch) will be similar to the following (2×8 H800):
- change the "pretrained" and "finetuned" path with your checkpoints in ``LAVIS-main/lavis/configs/models/varytiny/varytiny_inference.yaml'', such as:
python tests/models/test_varytiny.py --image-file xxx.jpg
- We also provide the model weights we trained Vary-tiny upon Vary-600k from scratch: Vary-tiny-600k.pth. Code: "Vary". You can use it and directly run the inference.
- Vary-600k is a PDF image-text pair dataset with about 30W English and 30W Chinese pages.
- The dataset is extracted using Fitz. A BERT model is used to merge sentences within paragraphs. Paragraphs are separated by "<lb>". The reason why we do not use "\n" is because we use "\n" as the "EOS" of opt-125m in this codebase.
- You can use Vary-600k for your pretrain, warm-up, and so on.
- Note that Vary-600k is only a sub-data of the pretrain data used in the original Vary.
- Download Vary-600k here. Code: "Vary"
- LAVIS: the codebase we built upon!
If you find our work useful in your research, please consider citing Vary:
@article{wei2023vary,
title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2312.06109},
year={2023}
}
@article{wei2024small,
title={Small Language Model Meets with Reinforced Vision Vocabulary},
author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2401.12503},
year={2024}
}