Youngsun Lim *, Hojun Choi *, Hyunjung Shim
(* indicates equal contributions)
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
{youngsun_ai, hchoi256, kateshim}@kaist.ac.kr
This is the official implementation of I-HallA v1.0.
Despite the huge success of text-to-image (TTI) generation models, existing studies seldom consider whether generated images accurately represent factual information. In this paper, we define the problem of image hallucination as the generated images fail to accurately depict factual information. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), an automatic evaluation metric that measures the factuality of generated images through visual question answering (VQA), and I-HallA v1.0, a curated benchmark dataset. We develop a three-stage pipeline that generates curated question-answer pairs using multiple GPT-4 Omni-based agents with human judgments. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly answer these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across 9 categories with varying levels of difficulty and 1,000 questions covering 9 compositions. We evaluate 5 different text-to-image models using I-HallA and demonstrate that these state-of-the-art models often fail to accurately convey factual information. Additionally, we establish the validity of our evaluation method through human evaluation, yielding a Spearman's correlation of 0.95. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.
- [β ] [2024.12.16] π¨βπ» The official codes have been released!
- [β ] [2024.12.10] π Our paper has been accepted to AAAI 2025!
- [β ] [2024.09.19] π Our paper is now available! You can find the paper here.
Clone and build the repo:
git clone https://github.com/hchoi256/I-HallA-v1.0.git
cd I-HallA-v1.0
pip install -r requirements.txt
- Prepare the image data for the I-HallA v1.0 benchmark from models/.
- Obtain the caption data for the I-HallA v1.0 benchmark from captions.xlsx.
- Obtain the QA sets for the history domain on the I-HallA v1.0 benchmark from LV_reasoning_history.xlsx
- Obtain the QA sets for the science domain on the I-HallA v1.0 benchmark from LV_reasoning_science.xlsx
- Obtain the reasoning data for the history domain on the I-HallA v1.0 benchmark from GPT4o_QA_history_mod_cois.xlsx
- Obtain the reasoning data for the science domain on I-HallA v1.0 benchmark from GPT4o_QA_science_mod_cois.xlsx
Then, put them under data/
The data structure looks like:
data/
βββ LV_reasoning_history.xlsx
βββ LV_reasoning_science.xlsx
βββ GPT4o_QA_science_mod_cois.xlsx
βββ GPT4o_QA_history_mod_cois.xlsx
βββ captions.xlsx
βββ models
β βββ dalle-3
β β βββ history
β β β βββ normal
β β β βββ weird
β β βββ science
β β β βββ normal
β β β βββ weird
β βββ sd-v1-4
β β βββ history
β β β βββ weird
β β βββ science
β β β βββ weird
β βββ sd-v1-5
β β βββ history
β β β βββ weird
β β βββ science
β β β βββ weird
β βββ sd-v2-0
β β βββ history
β β β βββ weird
β β βββ science
β β β βββ weird
β βββ sd-xl
β β βββ history
β β β βββ weird
β β βββ science
β β β βββ weird
The following is a quick start guide for evaluating the Image Hallucination scores of five different TTI generation models. Our method requires the GPT-4o API for evaluation.
-
Enter your API_KEY in
run.py
:YOUR_API_KEY = "[YOUR_API_KEY]"
-
Specify the type of TTI generation model to evaluate in
EvaluationAgent.py
:payload = { "model": "[YOUR_MODEL]", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "" }, ] } ], }
self.model_version = "YOUR_TTI_MODEL" # Options: "dalle-3", "sd-v1-4", "sd-v1-5", "sd-v2-0", "sd-xl" self.image_type = "weird"
To evaluate factual images collected from textbooks, use the following settings:
self.model_version = "dalle-3" self.image_type = "normal"
-
Choose a
<CATEGORY>
to evaluate inrun.py
and execute the following command:python run.py --category <CATEGORY> --agent_name EvaluationAgent
In addition to the five TTI generation models we evaluated, our benchmark also supports the evaluation of other TTI models.
To do this, follow these steps:
-
Load the captions.xlsx file and input the prompts for each domain into your TTI model.
Use the output to generate the data structure described here. -
Follow the instructions in the Quick Start guide and update the TTI model name to the appropriate one for your evaluation.
Our method uses a multi-agent design to create a benchmark for Image Hallucination evaluation.
Set up and run the agents using the following command:
python run.py --category <CATEGORY> --agent_name <AGENT>
ImageAgent.py
: The VLM determines whether an input image is "normal" or "weird" with reasoning.ReasoningAgent.py
: The VLM provides the reasoning behind its "normal" or "weird" determination for the input image.
CategoryAgent.py
: The VLM determines the category for the given captions.CoIAgent.py
: The VLM identifies the Composition of Interest for the given captions.
The data extracted by these agents is aggregated and integrated into the following files:
QAAgent.py
: Generates the Question-Answer (QA) sets.
EvaluationAgent.py
: Performs the evaluation process.
We provide an easy interface to perform VQA (Visual Question Answering) inference.
The supported image-to-text models include:
blip2-opt-2.7b
instructblip-vicuna-7b
llava-v1.6-34b-hf
You can also customize your own VQA models by creating a file in the vlms/<YOUR_MODEL>.py
directory.
-
Follow the Quick Start guide for data preparation and proper directory settings in the Python files.
-
Run the following command to execute inference:
python vlms/<VQA_MODEL>.py
Replace <VQA_MODEL>
with the name of your desired VQA model, such as blip2
, instructblip
, and llava2
.
If you find our work helpful, please cite us:
@article{ihalla,
author = {Youngsun Lim and
Hojun Choi and
Hyunjung Shim},
title = {Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering},
conference = {AAAI},
year = {2025},
}
We would like to sincerely thank their contributors, including GPT-4o, for their invaluable contributions.