Implementation of paper Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark.
- 2025 Jan 22: Fix issue in training dataset construction logic.
- 2024 Dec 10: Support offline inference with vLLM.
- 2024 Nov 28: Code for evaluation and fine-tuning is released. Code for training set construction is coming soon.
- 2024 Nov 25: Paper available on Arxiv. Code coming Soon.
Item | Repository |
---|---|
Distilled Model | 🤗DataHammer/T2I_Eval-MiniCPM-V-2_6 |
Benchmark Dataset | 🤗DataHammer/T2I-Eval-Bench |
Training Dataset | 🤗DataHammer/T2I-Eval-Distill-GPT-4o |
Methods | Manual-1 ( |
Manual-1 ( |
Manual-2 ( |
Manual-2 ( |
Manual-3 ( |
Manual-3 ( |
Manual-Avg. ( |
Manual-Avg. ( |
---|---|---|---|---|---|---|---|---|
Average Score of Human Annotators | ||||||||
0.9511 | 0.8807 | 0.9452 | 0.8686 | 0.9513 | 0.8793 | - | - | |
Traditional Methods | ||||||||
-0.1183 | -0.0871 | -0.1000 | -0.0724 | -0.0897 | -0.0685 | -0.1231 | -0.0862 | |
-0.1206 | -0.0898 | -0.0882 | -0.0644 | -0.1025 | -0.0732 | -0.1244 | -0.0856 | |
-0.1284 | -0.0953 | -0.1230 | -0.0897 | -0.1308 | -0.0973 | -0.1382 | -0.0968 | |
0.1532 | 0.1078 | 0.1725 | 0.1210 | 0.1227 | 0.0855 | 0.1505 | 0.1016 | |
0.2278 | 0.1588 | 0.2280 | 0.1617 | 0.2134 | 0.1477 | 0.2152 | 0.1423 | |
0.4171 | 0.3065 | 0.3712 | 0.2690 | 0.4134 | 0.3030 | 0.4046 | 0.2839 | |
LLM/MLLM-Based Methods | ||||||||
0.3009 | 0.2212 | 0.2697 | 0.2012 | 0.3299 | 0.2497 | 0.3096 | 0.2228 | |
0.4742 | 0.3790 | 0.4204 | 0.3339 | 0.4562 | 0.3652 | 0.4582 | 0.3512 | |
0.4815 | 0.3891 | 0.4382 | 0.3502 | 0.4721 | 0.3827 | 0.4704 | 0.3655 | |
0.4984 | 0.3768 | 0.4864 | 0.3619 | 0.5118 | 0.3854 | 0.5116 | 0.3712 | |
0.2834 | 0.2251 | 0.2814 | 0.2231 | 0.3016 | 0.2422 | 0.2941 | 0.2250 | |
0.4906 | 0.3878 | 0.4869 | 0.3836 | 0.4889 | 0.3899 | 0.5101 | 0.3897 | |
0.5522 | 0.4283 | 0.5306 | 0.4101 | 0.5170 | 0.4024 | 0.5545 | 0.4170 | |
Closed-Source MLLM with Our Pipeline | ||||||||
0.5437 | 0.4302 | 0.5355 | 0.4214 | 0.5138 | 0.4061 | 0.5566 | 0.4285 | |
Our Fine-tuned Open-source MLLM | ||||||||
0.5334 | 0.4192 | 0.5946 | 0.4644 | 0.5537 | 0.4348 | 0.5802 | 0.4409 |
conda create -n t2i_eval python=3.10 -y
conda activate t2i_eval
git clone https://github.com/maziao/T2I-Eval.git
cd T2I-Eval
pip install -r requirements.txt
Install vLLM for faster inference with this guide:
pip install vllm
[Optional] If you want to fine-tune your own evaluation MLLM, please setup SWIFT following this instruction:
pip install 'ms-swift[all]==2.6.1' -U
-
[Optional] Enable hf_transfer for faster downloads
pip install hf_transfer export HF_HUB_ENABLE_HF_TRANSFER=1
-
Download MiniCPM-V-2.6 checkpoint (NOTE: You need to agree to share your contact information to access this model)
# if MiniCPM-V-2.6 checkpoint has not been downloaded mkdir -p models/minicpm-v-2_6/original huggingface-cli download openbmb/MiniCPM-V-2_6 \ --local-dir models/minicpm-v-2_6/original \ --local-dir-use-symlink False # if MiniCPM-V-2.6 checkpoint has already been downloaded mkdir -p models/minicpm-v-2_6 ln -s PATH_TO_CHECKPOINT models/minicpm-v-2_6/original
-
Download LoRA adapter for MiniCPM-V-2.6
mkdir -p models/minicpm-v-2_6/LoRA huggingface-cli download DataHammer/T2I_Eval-MiniCPM-V-2_6 \ --local-dir models/minicpm-v-2_6/LoRA \ --local-dir-use-symlink False # merge LoRA adapter with the original checkpoint bash scripts/merge_adapter.sh GPU_ID # replace GPU_ID
-
Download benchmark dataset
mkdir -p data/test huggingface-cli download DataHammer/T2I-Eval-Bench \ --repo-type dataset \ --local-dir data/test \ --local-dir-use-symlink False tar -zxvf data/test/images.tar.gz --directory=data/test
-
[Optional] Download training dataset
# download the training dataset for fine-tuning (optional) mkdir -p data/train huggingface-cli download DataHammer/T2I-Eval-Distill-GPT-4o \ --repo-type dataset \ --local-dir data/train \ --local-dir-use-symlink False tar -zxvf data/train/t2i-eval-distill-gpt-4o.json.tar.gz --directory=data/train tar -zxvf data/train/images/COCO.tar.gz --directory=data/train/images tar -zxvf data/train/images/LLM.tar.gz --directory=data/train/images tar -zxvf data/train/images/reference.tar.gz --directory=data/train/images
-
Deployment
bash scripts/deploy.sh GPU_ID HOST_ADDR PORT # 0 localhost 65535
-
Inference
bash scripts/inference.sh HOST_ADDR PORT # localhost 65535
bash scripts/inference_offline.sh GPU_ID # replace GPU_ID
For example, when using MiniCPM-V-2.6, you need to create a new class that inherits from the base dataset constructor class,
-
$\verb|replace_image_placeholder|$ : replace the original image placeholders with the one of the model you are about to fine-tune. -
$\verb|conv_template|$ : construct a single training sample (a conversation) given the corpus. The given corpus follows this format:{ "gt_image": str, "query": str, "response": str, "history": [ (query_1, response_1), ... ], "ref_image": str | None, "id": str | None }
-
Convert training data into the target format
mkdir -p data/train/minicpm-v-2_6/raw python build_dataset.py \ --data-files data/train/t2i-eval-distill-gpt-4o.json \ --image-dir $(pwd)/data/train \ --output-dir data/train/minicpm-v-2_6/raw
-
Make an index file
data/train/minicpm-v-2_6/index.txt
for sub-tasks:extract.json 8 appearance-multi-stage_1.json 1 appearance-multi-stage_2.json 1 intrinsic-multi-stage_1.json 2 intrinsic-multi-stage_2.json 2 relationship-multi-stage_1.json 3 relationship-multi-stage_2.json 3 summarize-multi-stage_1.json 2 summarize-multi-stage_2.json 2
-
Merge training samples of all sub-tasks
python merge_dataset.py \ --data-dir data/train/minicpm-v-2_6/raw/t2i-eval-distill-gpt-4o \ --index-file data/train/minicpm-v-2_6/index.txt \ --output-file data/train/minicpm-v-2_6/train.json \ --rebalance
# data parallel on 4 Nvidia A100-SXM4-80GB
bash scripts/finetune_lora.sh 0,1,2,3 4
If you find this repository useful for your research, please cite our paper:
@misc{tu2024automaticevaluationtexttoimagegeneration,
title={Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark},
author={Rong-Cheng Tu and Zi-Ao Ma and Tian Lan and Yuehao Zhao and Heyan Huang and Xian-Ling Mao},
year={2024},
eprint={2411.15488},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.15488},
}