MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)

MM-Vet v2

[Paper] [Download Dataset] [Dataset on Hugging Face] [Leaderboard] [Online Evaluator]

2024/08/02: 🔥 🔥 We release MM-Vet v2, the extension of MM-Vet, which includes a new vision-langauge capability called "image-text sequence understanding", and expands the evaluation set size while maintaining the high quality.

2024/03/17: 🔥 🔥 We release inferences scripts for Qwen-VL and Claude. Qwen-VL-Max and Claude 3 Opus achieve 66.6% and 58.1%, respectively.

2023/12/23: 🔥 🔥 We release inferences scripts for GPT-4V and Gemini. Gemini Pro Vision achieves 64.3% score.

2023/10/24 🔥 🔥 We evaluate GPT-4V on MM-Vet and observe that it achieves 67.7% score, outperforming other methods with large margin (20%). However, it still has a large gap to the full mark (100%), indicating the need for efforts to further improve the integrated capabilities of LMMs. See leaderboard, updated paper and GPT-4V prediction examples.

2023/10/07 🔥 🔥 We released MM-Vet leaderboard on paperswithcode.com where you can add your model results conveniently. Note that date here means model date instead of paper date because some improved model versions are released after the paper.

In this repo, we offer data and evaluator of MM-Vet, proposed by our paper "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities". The code is under the Apache 2.0 license, and the dataset is under the CC BY-NC 4.0 license.

Figure 1: Different from conventional VL benchmarks only require one or two capabilities, MM-Vet focuses on the integration of different core VL capabilities, including recognition, OCR, knowledge, language generation, spatial awareness, and math.

Evalute your model on MM-Vet

Step 0: Install openai package with pip install openai>=1 and get access GPT-4/GPT-3.5 API. If you have not access, you can try MM-Vet online evaluator Hugging Face Space (but it may wait for long time depending on number of users).

Step 1: Download MM-Vet data here and unzip unzip mm-vet.zip.

Step 2: Infer your model on MM-Vet and save your model outputs in json like llava_llama2_13b_chat.json, or just use llava_llama2_13b_chat.json as example to evalute. We also release inference scripts for GPT-4V and Gemini.

image_detail=high # or auto, low refer to https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding

python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}

python inference/gemini_vision.py --mmvet_path /path/to/mm-vet

Step 3: git clone https://github.com/yuweihao/MM-Vet.git && cd MM-Vet, run LLM-based evaluator in mm-vet_evaluator.ipynb or mm-vet_evaluator.py (Thanks to @HireTheHero to arrange it into py version).

python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json

If you cannot access GPT-4 (gpt-4-0613), you can upload your model output results (json file) to MM-Vet online evaluator Hugging Face Space to get the grading results.

Citation

@inproceedings{yu2024mm,
  title={Mm-vet: Evaluating large multimodal models for integrated capabilities},
  author={Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan},
  booktitle={International conference on machine learning},
  year={2024},
  organization={PMLR}
}

GPT-4V Prediction Examples

About running Bard

Please refer to these two files: inference_bard.sh and inference_bard.py.

Some samples on MM-Vet

Q: What occasions would someone use this meme?

GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. It could be used in response to a startling or unexpected event, or to convey a sense of panic or alarm. Some possible occasions where someone might use this meme include:

Reacting to a jump scare in a horror movie
Responding to a surprising plot twist in a TV show or book
Expressing shock at a news headline or current event
Conveying fear or anxiety about an upcoming deadline or exam
Showing surprise at an unexpected outcome in a sports game or other competition.

Required capabilities: Recognition, knowledge, language generation

Q: How many tomatoes are there?

GT: 5

Required capabilities: Recognition

Q: What is located to the right of the shampoo?

GT: conditioner

Required capabilities: OCR, spatial awareness

Q: Which room is bigger, the double garage or the living room?

GT: double garage

Required capabilities: OCR, spatial awareness, math

Q: On the right desk, what is to the left of the laptop?

GT: table lamp <OR> desk lamp

Required capabilities: Recognition, spatial awareness

Q: What are all the scene text in the image?

GT: 5:30PM<AND>88%<AND>Mario Kart 8 Deluxe<AND>MARIO KART 8 DELUXE<AND>SUPER MARIO ODYSSEY<AND>THE LEGEND OF ZELDA<AND>BREATH OF WILD<AND>Options<AND>Start

Required capabilities: OCR

Q: How many gallons of supreme gasoline can I get with $50?

GT: 13.6 <OR> 13.7

Required capabilities: OCR, math

Q: In which country was this photo taken?

GT: Australia

Required capabilities: Recognition, knowledge

Q: Can you explain this meme?

GT: This meme is a humorous take on procrastination and the tendency to delay tasks until a specific time. The person in the meme plans to do something at 8 o'clock, but when they miss that deadline by a few minutes, they decide to wait until 9 o'clock instead. The image of Kermit the Frog lying in bed represents the person's laziness and lack of motivation to complete the task.

Required capabilities: Recognition, OCR, knowledge, language generation

Q: The graph below shows the long-term international migration, UK, 1999-2008.

Summarize the information by selecting and reporting the main features, and make comparisons where relevant.

You should write at least 150 words.

GT: The chart gives information about UK immigration, emigration and net migration between 1999 and 2008.

Both immigration and emigration rates rose over the period shown, but the figures for immigration were significantly higher. Net migration peaked in 2004 and 2007.

In 1999, over 450,000 people came to live in the UK, while the number of people who emigrated stood at just under 300,000. The figure for net migration was around 160,000, and it remained at a similar level until 2003. From 1999 to 2004, the immigration rate rose by nearly 150,000 people, but there was a much smaller rise in emigration. Net migration peaked at almost 250,000 people in 2004.

After 2004, the rate of immigration remained high, but the number of people emigrating fluctuated. Emigration fell suddenly in 2007, before peaking at about 420,000 people in 2008. As a result, the net migration figure rose to around 240,000 in 2007, but fell back to around 160,000 in 2008.

Required capabilities: Recognition, OCR, language generation, spatial awareness

Q: Which car is on the parking spot 33?

GT: no <OR> empty

Required capabilities: Recognition, OCR, spatial awareness

Q: Is this apple organic?

GT: yes

Required capabilities: Recognition, OCR

Q: Which are producers in this food web?

GT: Phytoplankton <AND> Seaweed

Required capabilities: OCR, knowledge, spatial awareness

Q: Is the person bigger than the car?

GT: no

Required capabilities: Recognition, knowledge, spatial awareness

Q: The table below gives information about the underground railway systems in six cities.

Summarise the information by selecting and reporting the main features, and make comparisons where relevant.

You should write at least 150 words.

GT: The table shows data about the underground rail networks in six major cities.

The table compares the six networks in terms of their age, size and the number of people who use them each year. It is clear that the three oldest underground systems are larger and serve significantly more passengers than the newer systems.

The London underground is the oldest system, having opened in 1863. It is also the largest system, with 394 kilometres of route. The second largest system, in Paris, is only about half the size of the London underground, with 199 kilometres of route. However, it serves more people per year. While only third in terms of size, the Tokyo system is easily the most used, with 1927 million passengers per year.

Of the three newer networks, the Washington DC underground is the most extensive, with 126 kilometres of route, compared to only 11 kilometres and 28 kilometres for the Kyoto and Los Angeles systems. The Los Angeles network is the newest, having opened in 2001, while the Kyoto network is the smallest and serves only 45 million passengers per year.

Required capabilities: OCR, language generation, spatial awareness

Q: What will the girl on the right write on the board?

GT: 14

Required capabilities: Recognition, OCR, spatial awareness, math

More samples are shown here.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
inference		inference
results		results
v2		v2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_more_samples.md		README_more_samples.md
mm-vet_evaluator.ipynb		mm-vet_evaluator.ipynb
mm-vet_evaluator.py		mm-vet_evaluator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)

MM-Vet v2

Evalute your model on MM-Vet

Citation

GPT-4V Prediction Examples

About running Bard

Some samples on MM-Vet

About

Releases

Packages

Contributors 3

Languages

License

yuweihao/MM-Vet

Folders and files

Latest commit

History

Repository files navigation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)

MM-Vet v2

Evalute your model on MM-Vet

Citation

GPT-4V Prediction Examples

About running Bard

Some samples on MM-Vet

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages