Skip to content

[ICLR'25] Official repository for "AVHBench: A Cross-Modal Hallucination Evaluation for Audio-Visual Large Language Models"

Notifications You must be signed in to change notification settings

kaist-ami/AVHBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models (ICLR 2025)

Authors: Kim Sung-Bin*, Oh Hyun-Bin*, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh
(* denotes equal contribution.)

Image

This repository contains the official dataset for the ICLR 2025 paper, "AVHBench: A Cross-Modal Hallucination Evluation for Audio-Visual Large Language Models". We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs.


Abstract: Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations, underscoring the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the crossmodal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by crossinteractions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.

Leaderboard

Audio-driven Video Hallucination

Rank Model Acc. (↑) Precision (↑) Recall (↑) F1 (↑) Yes (%)
🥇1st AVHModel-Align-FT 83.9 - - - -
🥈2nd Gemini-Flash 83.3 85.7 81.0 83.7 47.3
🥉3rd Video-SALMONN 78.1 74.9 84.5 79.4 56.4
4th Video-LLaMA2 75.2 73.6 78.7 76.1 53.6
5th PandaGPT 58.5 55.3 91.1 68.8 82.3
6th OneLLM 53.7 58.6 64.8 49.8 63.1
7th ChatBridge 52.9 70.9 52.9 48.9 77.6
8th ImageBind-LLM 50.3 50.2 87.1 63.7 86.7
9th Video-LLaMA 50.1 50.1 100 66.7 99.9
10th X-InstrcutBLIP 18.1 16.0 15.0 15.5 46.9

Video-driven Audio Hallucination

Rank Model Acc. (↑) Precision (↑) Recall (↑) F1 (↑) Yes (%)
🥇1st AVHModel-Align-FT 77.3 - - - -
🥈2nd Video-LLaMA2 74.2 69.4 86.6 77.0 62.4
🥉3rd Video-SALMONN 65.2 62.3 76.9 68.8 61.7
4th Gemini-Flash 63.0 57.9 94.7 71.9 81.7
5th PandaGPT 61.3 57.4 86.6 69.1 75.5
6th Video-LLaMA 50.2 50.2 100 66.9 100
7th ImageBind-LLM 50.0 50.0 99.3 66.5 99.3
8th OneLLM 44.3 50.2 39.4 49.8 55.0
9th ChatBridge 32.8 60.0 32.8 39.8 14.8
10th X-InstrcutBLIP 16.3 14.5 38.5 21.1 77.0
  • Ranked by Accuracy
  • AVHModel-Align-FT refers to our final model presented in the fourth row of Table 4 in the main paper.
  • Last update: April 4th, 2025

Download the AVHBench Dataset

At this time, we provide a subset of AVHBench, which includes both real and synthetic (swapped) video samples.

  1. Download AVHBench dataset (videos|QA)
  2. Details of each file in the dataset
    • {video_id}.mp4: a real video sample sourced from VALOR and Audiocaps.

    • {QA}.json: All the question-and-answer pairs for the video, which contain the metadata of: (1) video id, (2) type of hallucination task, (3) input text prompt, and (4) ground-truth label. We provide an example of the json file below:

      [
         {
           "video_id": "00191",
           "task": "Video-driven Audio Hallucination",
           "text": "Is the sleeping man making sound in the audio?",
           "label": "Yes"
         },
         {
           "video_id": "00191",
           "task": "Video-driven Audio Hallucination",
           "text": "Is the couch making sound in the audio?",
           "label": "No"
         },
         {
           "video_id": "00191",
           "task": "Audio-driven Video Hallucination",
           "text": "Is the sleeping man visible in the video?",
           "label": "Yes"
         },
         {
           "video_id": "00191",
           "task": "Audio-driven Video Hallucination",
           "text": "Is the person walking visible in the video?",
           "label": "No"
         },
         {
           "video_id": "00191",
           "task": "AV Matching",
           "text": "Are the contexts of audio and visual content matching?",
           "label": "Yes"
         },
         {
           "video_id": "00191",
           "task": "AV Captioning",
           "text": "Describe what you see and hear in a single sentence.",
           "label": "A young man with no shirt on is laying in bed, with footsteps walking on a hard surface followed by a person snoring."
         }
      ]
      

Download the AVHModel-Align-FT Model's Checkpoints

You can fine the checkpoints to inference the model in the link below:
https://drive.google.com/drive/folders/18DqvDXVGNr3lFmU9U-Xw2btzfOaozfPv?usp=sharing

The model and inference code are based on a modified version of AffectGPT.

Citation

If you use AVHBench in a research paper, please cite our work as follows:

@article{sung2024avhbench,
  title={AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models},
  author={Sung-Bin, Kim and Hyun-Bin, Oh and Lee, JungMok and Senocak, Arda and Chung, Joon Son and Oh, Tae-Hyun},
  journal={arXiv preprint arXiv:2410.18325},
  year={2024}
}

Acknowledgement

We are grateful for the following awesome projects, our AVHBench arising from:

  • GPT4: Language Models are Few-Shot Learners
  • Recognize Anything Model: Visual Tagging Models for Dataset Construction Pipeline
  • VALOR: VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
  • AudioCaps: AudioCaps: Generating Captions for Audios in the Wild

About

[ICLR'25] Official repository for "AVHBench: A Cross-Modal Hallucination Evaluation for Audio-Visual Large Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •