Skip to content

kaist-cvml/I-HallA-v1.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim *, Hojun Choi *, Hyunjung Shim
(* indicates equal contributions)
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
{youngsun_ai, hchoi256, kateshim}@kaist.ac.kr
This is the official implementation of I-HallA v1.0.

arXiv AAAI

drawing



I-HallA-v1.0

Despite the huge success of text-to-image (TTI) generation models, existing studies seldom consider whether generated images accurately represent factual information. In this paper, we define the problem of image hallucination as the generated images fail to accurately depict factual information. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), an automatic evaluation metric that measures the factuality of generated images through visual question answering (VQA), and I-HallA v1.0, a curated benchmark dataset. We develop a three-stage pipeline that generates curated question-answer pairs using multiple GPT-4 Omni-based agents with human judgments. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly answer these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across 9 categories with varying levels of difficulty and 1,000 questions covering 9 compositions. We evaluate 5 different text-to-image models using I-HallA and demonstrate that these state-of-the-art models often fail to accurately convey factual information. Additionally, we establish the validity of our evaluation method through human evaluation, yielding a Spearman's correlation of 0.95. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.

drawing drawing drawing

Updates

  • [βœ…] [2024.12.16] πŸ‘¨β€πŸ’» The official codes have been released!
  • [βœ…] [2024.12.10] πŸŽ‰ Our paper has been accepted to AAAI 2025!
  • [βœ…] [2024.09.19] πŸ“„ Our paper is now available! You can find the paper here.

Quick Links


Installation

Clone and build the repo:

git clone https://github.com/hchoi256/I-HallA-v1.0.git
cd I-HallA-v1.0
pip install -r requirements.txt

Data Preparation (We have included links to each of our benchmark datasets.)

Then, put them under data/ The data structure looks like:

data/
β”œβ”€β”€ LV_reasoning_history.xlsx
β”œβ”€β”€ LV_reasoning_science.xlsx
β”œβ”€β”€ GPT4o_QA_science_mod_cois.xlsx
β”œβ”€β”€ GPT4o_QA_history_mod_cois.xlsx
β”œβ”€β”€ captions.xlsx
β”œβ”€β”€ models
β”‚   β”œβ”€β”€ dalle-3
β”‚   β”‚   β”œβ”€β”€ history
β”‚   β”‚   β”‚   β”œβ”€β”€ normal
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”‚   β”œβ”€β”€ science
β”‚   β”‚   β”‚   β”œβ”€β”€ normal
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”œβ”€β”€ sd-v1-4
β”‚   β”‚   β”œβ”€β”€ history
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”‚   β”œβ”€β”€ science
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”œβ”€β”€ sd-v1-5
β”‚   β”‚   β”œβ”€β”€ history
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”‚   β”œβ”€β”€ science
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”œβ”€β”€ sd-v2-0
β”‚   β”‚   β”œβ”€β”€ history
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”‚   β”œβ”€β”€ science
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”œβ”€β”€ sd-xl
β”‚   β”‚   β”œβ”€β”€ history
β”‚   β”‚   β”‚   β”œβ”€β”€ weird
β”‚   β”‚   β”œβ”€β”€ science
β”‚   β”‚   β”‚   β”œβ”€β”€ weird

Quick Start

The following is a quick start guide for evaluating the Image Hallucination scores of five different TTI generation models. Our method requires the GPT-4o API for evaluation.

Steps:

  1. Enter your API_KEY in run.py:

    YOUR_API_KEY = "[YOUR_API_KEY]"
  2. Specify the type of TTI generation model to evaluate in EvaluationAgent.py:

    payload = {
        "model": "[YOUR_MODEL]",
        "messages": [
        {
            "role": "user",
            "content": [
            {
                "type": "text",
                "text": ""
            },
            ]
        }
        ],
    }
    self.model_version = "YOUR_TTI_MODEL"  # Options: "dalle-3", "sd-v1-4", "sd-v1-5", "sd-v2-0", "sd-xl"
    self.image_type = "weird"

    To evaluate factual images collected from textbooks, use the following settings:

    self.model_version = "dalle-3"
    self.image_type = "normal"
  3. Choose a <CATEGORY> to evaluate in run.py and execute the following command:

    python run.py --category <CATEGORY> --agent_name EvaluationAgent

Evaluating on any TTI models

In addition to the five TTI generation models we evaluated, our benchmark also supports the evaluation of other TTI models.
To do this, follow these steps:

  1. Load the captions.xlsx file and input the prompts for each domain into your TTI model.
    Use the output to generate the data structure described here.

  2. Follow the instructions in the Quick Start guide and update the TTI model name to the appropriate one for your evaluation.


I-HallA v1.0 Benchmark

Our method uses a multi-agent design to create a benchmark for Image Hallucination evaluation.
Set up and run the agents using the following command:

python run.py --category <CATEGORY> --agent_name <AGENT>

1. Data Reasoning Extraction Stage:

  • ImageAgent.py: The VLM determines whether an input image is "normal" or "weird" with reasoning.
    • ReasoningAgent.py: The VLM provides the reasoning behind its "normal" or "weird" determination for the input image.
  • CategoryAgent.py: The VLM determines the category for the given captions.
  • CoIAgent.py: The VLM identifies the Composition of Interest for the given captions.

The data extracted by these agents is aggregated and integrated into the following files:

2. QA Sets Generation Stage:

  • QAAgent.py: Generates the Question-Answer (QA) sets.

3. Evaluation Stage:

  • EvaluationAgent.py: Performs the evaluation process.

VQA Modules

We provide an easy interface to perform VQA (Visual Question Answering) inference.
The supported image-to-text models include:

  • blip2-opt-2.7b
  • instructblip-vicuna-7b
  • llava-v1.6-34b-hf

You can also customize your own VQA models by creating a file in the vlms/<YOUR_MODEL>.py directory.

Steps to Perform Inference:

  1. Follow the Quick Start guide for data preparation and proper directory settings in the Python files.

  2. Run the following command to execute inference:

    python vlms/<VQA_MODEL>.py

Replace <VQA_MODEL> with the name of your desired VQA model, such as blip2, instructblip, and llava2.


Citation

If you find our work helpful, please cite us:

@article{ihalla,
  author       = {Youngsun Lim and
                  Hojun Choi and
                  Hyunjung Shim},
  title        = {Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering},
  conference   = {AAAI},
  year         = {2025},
}

Acknowledgement

We would like to sincerely thank their contributors, including GPT-4o, for their invaluable contributions.