For NLP ability evaluations except MMLU, we directly employ the benchmark toolbox lm-evaluation-harness. Make sure you follow their guidelines to install the toolbox, and run these commands for one whole NLP evaluation (except MMLU).
For finetuned MLLMs:
export ft_model_path=PahaII/MM-LLaMA-2-7B-ft
cd ../../eval_scripts
bash eval_nlp.sh /your/path/to/lm-evaluation-harness ft_model_path
For lora-tuned MLLMs:
export base_model_path=meta-llama/llama-2-7b-chat-hf
export lora_model=lora-repo-id-hf
cd ../../eval_scripts
bash eval_nlp.sh /your/path/to/lm-evaluation-harness base_model_path lora_model
For MMLU evaluation, first download data here. Then evaluate the model:
export ft_model_path=PahaII/MM-LLaMA-2-7B-ft
export base_model_path=meta-llama/llama-2-7b-hf
export lora_model=PahaII/MM-LLaMA-2-7B-lora
cd ../../eval_scripts
## For finetuned MLLMs
bash eval_mmlu.sh /path/to/mmlu_data ft_model_path
## For lora-tuned MLLMs
bash eval_mmlu.sh /path/to/mmlu_data base_model_path lora_model
This directory contains end-to-end pipelines for evaluations on seven multi-modal benchmarks (including two tasks with corrupted images) of our trained MLLMs. We will introduce the evaluation pipeline and the data downloading guides in this document. First, define the model:
export ft_model_path=PahaII/MM-LLaMA-2-7B-ft
export base_model_path=meta-llama/llama-2-7b-hf
export lora_model=PahaII/MM-LLaMA-2-7B-lora
cd ../../eval_scripts
Download the Val images and Val annotations for evaluation.
## For finetuned MLLMs
bash eval_vqav2.sh /path/to/image /path/to/vqa_question /path/to/vqa_annotation ft_model_path
## For lora-tuned MLLMs
bash eval_vqav2.sh /path/to/image /path/to/vqa_question /path/to/vqa_annotation base_model_path lora_model
For evaluating on corrputed images, just simply replace the image directory path (/path/to/image
).
Download the post-processed data from here. The resulting MSCOCO Captioning data should look like this:
.
├── ./mscoco/
├── mscoco_train.json # Contains the training set text captions of MSCOCO
├── mscoco_val.json # Contains the validation set text captions of MSCOCO
├── mscoco_test.json # Contains the test set text captions of MSCOCO
└── test_images # Contains the test set images of MSCOCO
## For finetuned MLLMs
bash eval_mscoco.sh /path/to/test_images /path/to/mscoco_test.json /path/to/mscoco_train.json ft_model_path
## For lora-tuned MLLMs
bash eval_mscoco.sh /path/to/test_images /path/to/mscoco_test.json /path/to/mscoco_train.json base_model_path lora_model
Download the post-processed data from here. The resulting Flcikr30K Captioning data should look like this:
.
├── ./flickr30k/
├── flickr30k_train.json # Contains the training set text captions of Flickr30k
├── flickr30k_val.json # Contains the validation set text captions of Flickr30k
├── flickr30k_test.json # Contains the test set text captions of Flickr30k
└── test_images # Contains the test set images of Flickr30k
## For finetuned MLLMs
bash eval_mscoco.sh /path/to/test_images /path/to/flickr30k_test.json /path/to/flickr30k_train.json ft_model_path
## For lora-tuned MLLMs
bash eval_mscoco.sh /path/to/test_images /path/to/flickr30k_test.json /path/to/flickr30k_train.json base_model_path lora_model
Download the dataset follow instructions here.
## For finetuned MLLMs
bash eval_mme.sh /path/to/mme ft_model_path
## For lora-tuned MLLMs
bash eval_mme.sh /path/to/mme base_model_path lora_model
Download and post-process the dataset follow instructions here.
## For finetuned MLLMs
bash eval_pope.sh /path/to/mscoco/val2014 /path/to/pope/annotations ft_model_path
## For lora-tuned MLLMs
bash eval_pope.sh /path/to/mscoco/val2014 /path/to/pope/annotations base_model_path lora_model