[Paper] [Download Dataset] [Dataset on Hugging Face] [Leaderboard] [Online Evaluator]
2024/08/02: π₯ π₯ We release MM-Vet v2, the extension of MM-Vet, which includes a new vision-langauge capability called "image-text sequence understanding", and expands the evaluation set size while maintaining the high quality.
2024/03/17: π₯ π₯ We release inferences scripts for Qwen-VL and Claude. Qwen-VL-Max and Claude 3 Opus achieve 66.6% and 58.1%, respectively.
2023/12/23: π₯ π₯ We release inferences scripts for GPT-4V and Gemini. Gemini Pro Vision achieves 64.3% score.
2023/10/24 π₯ π₯ We evaluate GPT-4V on MM-Vet and observe that it achieves 67.7% score, outperforming other methods with large margin (20%). However, it still has a large gap to the full mark (100%), indicating the need for efforts to further improve the integrated capabilities of LMMs. See leaderboard, updated paper and GPT-4V prediction examples.
2023/10/07 π₯ π₯ We released MM-Vet leaderboard on paperswithcode.com where you can add your model results conveniently. Note that date here means model date instead of paper date because some improved model versions are released after the paper.
In this repo, we offer data and evaluator of MM-Vet, proposed by our paper "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities". The code is under the Apache 2.0 license, and the dataset is under the CC BY-NC 4.0 license.
Figure 1: Different from conventional VL benchmarks only require one or two capabilities, MM-Vet focuses on the integration of different core VL capabilities, including recognition, OCR, knowledge, language generation, spatial awareness, and math.
Step 0: Install openai package with pip install openai>=1
and get access GPT-4/GPT-3.5 API. If you have not access, you can try MM-Vet online evaluator Hugging Face Space (but it may wait for long time depending on number of users).
Step 1: Download MM-Vet data here and unzip unzip mm-vet.zip
.
Step 2: Infer your model on MM-Vet and save your model outputs in json like llava_llama2_13b_chat.json, or just use llava_llama2_13b_chat.json as example to evalute. We also release inference scripts for GPT-4V and Gemini.
image_detail=high # or auto, low refer to https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding
python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}
python inference/gemini_vision.py --mmvet_path /path/to/mm-vet
Step 3: git clone https://github.com/yuweihao/MM-Vet.git && cd MM-Vet
, run LLM-based evaluator in mm-vet_evaluator.ipynb or mm-vet_evaluator.py (Thanks to @HireTheHero to arrange it into py version).
python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json
If you cannot access GPT-4 (gpt-4-0613), you can upload your model output results (json file) to MM-Vet online evaluator Hugging Face Space to get the grading results.
@inproceedings{yu2024mm,
title={Mm-vet: Evaluating large multimodal models for integrated capabilities},
author={Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan},
booktitle={International conference on machine learning},
year={2024},
organization={PMLR}
}
Please refer to these two files: inference_bard.sh and inference_bard.py.
Q: What occasions would someone use this meme?
GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. It could be used in response to a startling or unexpected event, or to convey a sense of panic or alarm. Some possible occasions where someone might use this meme include:
- Reacting to a jump scare in a horror movie
- Responding to a surprising plot twist in a TV show or book
- Expressing shock at a news headline or current event
- Conveying fear or anxiety about an upcoming deadline or exam
- Showing surprise at an unexpected outcome in a sports game or other competition.
Required capabilities: Recognition, knowledge, language generation
Q: How many tomatoes are there?
GT: 5
Required capabilities: Recognition
Q: What is located to the right of the shampoo?
GT: conditioner
Required capabilities: OCR, spatial awareness
Q: Which room is bigger, the double garage or the living room?
GT: double garage
Required capabilities: OCR, spatial awareness, math
Q: On the right desk, what is to the left of the laptop?
GT: table lamp <OR> desk lamp
Required capabilities: Recognition, spatial awareness
Q: What are all the scene text in the image?
GT: 5:30PM<AND>88%<AND>Mario Kart 8 Deluxe<AND>MARIO KART 8 DELUXE<AND>SUPER MARIO ODYSSEY<AND>THE LEGEND OF ZELDA<AND>BREATH OF WILD<AND>Options<AND>Start
Required capabilities: OCR
Q: How many gallons of supreme gasoline can I get with $50?
GT: 13.6 <OR> 13.7
Required capabilities: OCR, math
Q: In which country was this photo taken?
GT: Australia
Required capabilities: Recognition, knowledge
Q: Can you explain this meme?
GT: This meme is a humorous take on procrastination and the tendency to delay tasks until a specific time. The person in the meme plans to do something at 8 o'clock, but when they miss that deadline by a few minutes, they decide to wait until 9 o'clock instead. The image of Kermit the Frog lying in bed represents the person's laziness and lack of motivation to complete the task.
Required capabilities: Recognition, OCR, knowledge, language generation
Q: The graph below shows the long-term international migration, UK, 1999-2008.
Summarize the information by selecting and reporting the main features, and make comparisons where relevant.
You should write at least 150 words.
GT: The chart gives information about UK immigration, emigration and net migration between 1999 and 2008.
Both immigration and emigration rates rose over the period shown, but the figures for immigration were significantly higher. Net migration peaked in 2004 and 2007.
In 1999, over 450,000 people came to live in the UK, while the number of people who emigrated stood at just under 300,000. The figure for net migration was around 160,000, and it remained at a similar level until 2003. From 1999 to 2004, the immigration rate rose by nearly 150,000 people, but there was a much smaller rise in emigration. Net migration peaked at almost 250,000 people in 2004.
After 2004, the rate of immigration remained high, but the number of people emigrating fluctuated. Emigration fell suddenly in 2007, before peaking at about 420,000 people in 2008. As a result, the net migration figure rose to around 240,000 in 2007, but fell back to around 160,000 in 2008.
Required capabilities: Recognition, OCR, language generation, spatial awareness
Q: Which car is on the parking spot 33?
GT: no <OR> empty
Required capabilities: Recognition, OCR, spatial awareness
Q: Is this apple organic?
GT: yes
Required capabilities: Recognition, OCR
Q: Which are producers in this food web?
GT: Phytoplankton <AND> Seaweed
Required capabilities: OCR, knowledge, spatial awareness
Q: Is the person bigger than the car?
GT: no
Required capabilities: Recognition, knowledge, spatial awareness
Q: The table below gives information about the underground railway systems in six cities.
Summarise the information by selecting and reporting the main features, and make comparisons where relevant.
You should write at least 150 words.
GT: The table shows data about the underground rail networks in six major cities.
The table compares the six networks in terms of their age, size and the number of people who use them each year. It is clear that the three oldest underground systems are larger and serve significantly more passengers than the newer systems.
The London underground is the oldest system, having opened in 1863. It is also the largest system, with 394 kilometres of route. The second largest system, in Paris, is only about half the size of the London underground, with 199 kilometres of route. However, it serves more people per year. While only third in terms of size, the Tokyo system is easily the most used, with 1927 million passengers per year.
Of the three newer networks, the Washington DC underground is the most extensive, with 126 kilometres of route, compared to only 11 kilometres and 28 kilometres for the Kyoto and Los Angeles systems. The Los Angeles network is the newest, having opened in 2001, while the Kyoto network is the smallest and serves only 45 million passengers per year.
Required capabilities: OCR, language generation, spatial awareness
Q: What will the girl on the right write on the board?
GT: 14
Required capabilities: Recognition, OCR, spatial awareness, math
More samples are shown here.