shitsukan-eval

Evaluating Model (LLM / LVLM) Alignment with Human Perception

📄 Paper (Coming Soon) | 🚀 Project Page | 🤗 Dataset

We evaluate the alignment of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) with human perception, focusing on the Japanese concept of shitsukan.
Shitsukan represents the sensory experience when perceiving objects, an inherently vague and highly subjective concept.
We created a new dataset of shitsukan terms recalled by individuals in response to images of specified objects. We also designed benchmark tasks to evaluate the shitsukan recognition capabilities of LLMs and LVLMs.

This library is experimental and under active development. We plan to add some breaking changes in the future to improve the usability and performance of the library.

# Prepare COCO 2017 images
mkdir -p $HOME/data/images
cd $HOME/data/images
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
unzip train2017.zip
unzip val2017.zip

# Prepare our Shitsukan datasets
mkdir -p  $HOME/shitsukan-eval/data
cd $HOME/shitsukan-eval/data
git lfs install
git clone https://huggingface.co/datasets/<ANONYMOUS>/Shitsukan

3. Run Evaluation

The following command evaluate the specified model on the specified tasks in shitsukan-eval.

export CUDA_VISIBLE_DEVICES=0
uv run python -m shitsukan_eval \
    --model "<model_name_or_path>" \
    --model-type "<model_type>" \
    --tasks "<task_name>" \
    --sub-tasks "<sub-task_name>" \
    --lang "<lang>" \
    --image-dir "<base-image_path>" \
    --save-dir outputs \
    --verbose

Explanation of the available arguments

--model (str): The name or path of the model to evaluate. (e.g., "Qwen/Qwen2-VL-7B-Instruct")
--model-type (str): The model type of the specified model.
- Model type that can be specified: "api", "hf", "vllm"
--tasks (str): The task name to evaluate.
- Tasks that can be specified: "perception", "commonsense", "taxonomic"
--sub-tasks (List[str]): List of sub-tasks within the tasks.
- In case of --tasks "perception" --language "ja", Sub-tasks that can be specified: "generation", "selection"
- In case of --tasks "perception" --language "en", Sub-tasks that can be specified: "selection"
- In case of --tasks "commonsense" --language "ja", Sub-tasks that can be specified: "generation", "classification"
- In case of --tasks "commonsense" --language "en", Sub-tasks that can't be specified
- In case of --tasks "taxonomic" --language "ja", Sub-tasks that can be specified: "a_b_classification", "yes_no_classification", "multiple_choice_classification"
- In case of --tasks "taxonomic" --language "en", Sub-tasks that can't be specified
--lang (str): Language to use for the evaluation (default: "ja").
- Language that can be specified: "ja", "en"
--image-dir (Optional[str]): Directory where input images are stored (optional).
- If you specify --image-dir="data", the evaluation script will reference the COCO 2017 images located at data/images/coco2017/train2017/*.png and data/images/coco2017/val2017/*.png during execution. If you have not prepared the COCO 2017 images, please download them in advance from here.
--save-dir (str): Directory where evaluation results will be saved.
-v, --verbose (bool): If set, print detailed information during processing.

[!NOTE] The configuration files for each task are located at shitsukan_eval/tasks/{task}/{sub_task}/{task}_{sub_task}_{lang}.yaml.
If you want to modify the settings, please change them here.

Citation

@inproceedings{shiono-etal-2025-evaluating,
    title = "Evaluating Model Alignment with Human Perception: A Study on Shitsukan in {LLM}s and {LVLM}s",
    author = "Shiono, Daiki  and
      Brassard, Ana  and
      Ishizuki, Yukiko  and
      Suzuki, Jun",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.757/",
    pages = "11428--11444",
}

Acknowledgement

(🚧 Here: Add description for this repo 🚧)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
images		images
models		models
shitsukan_eval		shitsukan_eval
tools/utils		tools/utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

shitsukan-eval

Evaluating Model (LLM / LVLM) Alignment with Human Perception

Table of Contents

Supported Models

Usage

1. Build Environment

2. Data Preparation

3. Run Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

cl-tohoku/shitsukan-eval

Folders and files

Latest commit

History

Repository files navigation

shitsukan-eval

Evaluating Model (LLM / LVLM) Alignment with Human Perception

Table of Contents

Supported Models

Usage

1. Build Environment

2. Data Preparation

3. Run Evaluation

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages