GitHub - kaist-cvml/I-HallA-v1.0: [AAAI 2025] Official Implementation of I-HallA v1.0

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim^, Hojun Choi^, Hyunjung Shim
(* indicates equal contributions)
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
`{youngsun_ai, hchoi256, kateshim}@kaist.ac.kr`
This is the official implementation of I-HallA v1.0.

I-HallA-v1.0

Despite the huge success of text-to-image (TTI) generation models, existing studies seldom consider whether generated images accurately represent factual information. In this paper, we define the problem of image hallucination as the generated images fail to accurately depict factual information. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), an automatic evaluation metric that measures the factuality of generated images through visual question answering (VQA), and I-HallA v1.0, a curated benchmark dataset. We develop a three-stage pipeline that generates curated question-answer pairs using multiple GPT-4 Omni-based agents with human judgments. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly answer these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across 9 categories with varying levels of difficulty and 1,000 questions covering 9 compositions. We evaluate 5 different text-to-image models using I-HallA and demonstrate that these state-of-the-art models often fail to accurately convey factual information. Additionally, we establish the validity of our evaluation method through human evaluation, yielding a Spearman's correlation of 0.95. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.

Updates

[✅] [2024.12.16] 👨‍💻 The official codes have been released!
[✅] [2024.12.10] 🎉 Our paper has been accepted to AAAI 2025!
[✅] [2024.09.19] 📄 Our paper is now available! You can find the paper here.

Installation

Clone and build the repo:

git clone https://github.com/hchoi256/I-HallA-v1.0.git
cd I-HallA-v1.0
pip install -r requirements.txt

Data Preparation (We have included links to each of our benchmark datasets.)

Prepare the image data for the I-HallA v1.0 benchmark from models/.
Obtain the caption data for the I-HallA v1.0 benchmark from captions.xlsx.
Obtain the QA sets for the history domain on the I-HallA v1.0 benchmark from LV_reasoning_history.xlsx
Obtain the QA sets for the science domain on the I-HallA v1.0 benchmark from LV_reasoning_science.xlsx
Obtain the reasoning data for the history domain on the I-HallA v1.0 benchmark from GPT4o_QA_history_mod_cois.xlsx
Obtain the reasoning data for the science domain on I-HallA v1.0 benchmark from GPT4o_QA_science_mod_cois.xlsx

Then, put them under data/ The data structure looks like:

data/
├── LV_reasoning_history.xlsx
├── LV_reasoning_science.xlsx
├── GPT4o_QA_science_mod_cois.xlsx
├── GPT4o_QA_history_mod_cois.xlsx
├── captions.xlsx
├── models
│   ├── dalle-3
│   │   ├── history
│   │   │   ├── normal
│   │   │   ├── weird
│   │   ├── science
│   │   │   ├── normal
│   │   │   ├── weird
│   ├── sd-v1-4
│   │   ├── history
│   │   │   ├── weird
│   │   ├── science
│   │   │   ├── weird
│   ├── sd-v1-5
│   │   ├── history
│   │   │   ├── weird
│   │   ├── science
│   │   │   ├── weird
│   ├── sd-v2-0
│   │   ├── history
│   │   │   ├── weird
│   │   ├── science
│   │   │   ├── weird
│   ├── sd-xl
│   │   ├── history
│   │   │   ├── weird
│   │   ├── science
│   │   │   ├── weird

Quick Start

The following is a quick start guide for evaluating the Image Hallucination scores of five different TTI generation models. Our method requires the GPT-4o API for evaluation.

Steps:

Enter your API_KEY in run.py:
```
YOUR_API_KEY = "[YOUR_API_KEY]"
```

Specify the type of TTI generation model to evaluate in EvaluationAgent.py:

payload = {
    "model": "[YOUR_MODEL]",
    "messages": [
    {
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": ""
        },
        ]
    }
    ],
}

self.model_version = "YOUR_TTI_MODEL"  # Options: "dalle-3", "sd-v1-4", "sd-v1-5", "sd-v2-0", "sd-xl"
self.image_type = "weird"

To evaluate factual images collected from textbooks, use the following settings:

self.model_version = "dalle-3"
self.image_type = "normal"

Choose a <CATEGORY> to evaluate in run.py and execute the following command:
```
python run.py --category <CATEGORY> --agent_name EvaluationAgent
```

Evaluating on any TTI models

In addition to the five TTI generation models we evaluated, our benchmark also supports the evaluation of other TTI models.
To do this, follow these steps:

Load the captions.xlsx file and input the prompts for each domain into your TTI model.
Use the output to generate the data structure described here.
Follow the instructions in the Quick Start guide and update the TTI model name to the appropriate one for your evaluation.

I-HallA v1.0 Benchmark

Our method uses a multi-agent design to create a benchmark for Image Hallucination evaluation.
Set up and run the agents using the following command:

python run.py --category <CATEGORY> --agent_name <AGENT>

1. Data Reasoning Extraction Stage:

ImageAgent.py: The VLM determines whether an input image is "normal" or "weird" with reasoning.
- ReasoningAgent.py: The VLM provides the reasoning behind its "normal" or "weird" determination for the input image.
CategoryAgent.py: The VLM determines the category for the given captions.
CoIAgent.py: The VLM identifies the Composition of Interest for the given captions.

The data extracted by these agents is aggregated and integrated into the following files:

2. QA Sets Generation Stage:

QAAgent.py: Generates the Question-Answer (QA) sets.

3. Evaluation Stage:

EvaluationAgent.py: Performs the evaluation process.

VQA Modules

We provide an easy interface to perform VQA (Visual Question Answering) inference.
The supported image-to-text models include:

blip2-opt-2.7b
instructblip-vicuna-7b
llava-v1.6-34b-hf

You can also customize your own VQA models by creating a file in the vlms/<YOUR_MODEL>.py directory.

Steps to Perform Inference:

Follow the Quick Start guide for data preparation and proper directory settings in the Python files.
Run the following command to execute inference:
```
python vlms/<VQA_MODEL>.py
```

Replace <VQA_MODEL> with the name of your desired VQA model, such as blip2, instructblip, and llava2.

Citation

If you find our work helpful, please cite us:

@article{ihalla,
  author       = {Youngsun Lim and
                  Hojun Choi and
                  Hyunjung Shim},
  title        = {Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering},
  conference   = {AAAI},
  year         = {2025},
}

Acknowledgement

We would like to sincerely thank their contributors, including GPT-4o, for their invaluable contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
agents		agents
assets		assets
vlms		vlms
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim^, Hojun Choi^, Hyunjung Shim
(* indicates equal contributions)
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
`{youngsun_ai, hchoi256, kateshim}@kaist.ac.kr`
This is the official implementation of I-HallA v1.0.

I-HallA-v1.0

Updates

Quick Links

Installation

Data Preparation (We have included links to each of our benchmark datasets.)

Quick Start

Steps:

Evaluating on any TTI models

I-HallA v1.0 Benchmark

1. Data Reasoning Extraction Stage:

2. QA Sets Generation Stage:

3. Evaluation Stage:

VQA Modules

Steps to Perform Inference:

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

kaist-cvml/I-HallA-v1.0

Folders and files

Latest commit

History

Repository files navigation

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim *, Hojun Choi *, Hyunjung Shim (* indicates equal contributions) Graduate School of Artificial Intelligence, KAIST, Republic of Korea {youngsun_ai, hchoi256, kateshim}@kaist.ac.kr This is the official implementation of I-HallA v1.0.

I-HallA-v1.0

Updates

Quick Links

Installation

Data Preparation (We have included links to each of our benchmark datasets.)

Quick Start

Steps:

Evaluating on any TTI models

I-HallA v1.0 Benchmark

1. Data Reasoning Extraction Stage:

2. QA Sets Generation Stage:

3. Evaluation Stage:

VQA Modules

Steps to Perform Inference:

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Youngsun Lim^, Hojun Choi^, Hyunjung Shim
(* indicates equal contributions)
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
`{youngsun_ai, hchoi256, kateshim}@kaist.ac.kr`
This is the official implementation of I-HallA v1.0.

Packages