JMultiWOZ: Japanese Multi-Domain Wizard-of-Oz Dataset

This repo contains JMultiWOZ, a Japanese Multi-Domain Wizard-of-Oz Dataset, and code for benchmarking task-oriented dialogue models on the dataset. More details about the dataset can be found in our LREC-COLING 2024 paper:

JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset [Paper]

Requirements

Python 3.9+

pip install -r requirements.txt

Dataset

Prepare the dataset by unzipping the dataset file JMultiWOZ_1.0.zip in the dataset directory:

cd dataset
unzip JMultiWOZ_1.0.zip

Detailed information about the dataset can be found in dataset/README.md.

We also provide the dataset on Hugging Face 🤗 Datasets Hub. See the dataset card for more details.

Dialogue Models

We provide the implementation of the following models as used in our paper:

Model	README link	Description
T5	tod_models/t5/README.md	T5 model fine-tuned on JMultiWOZ
LLM	tod_models/llm/README.md	Zero-shot and few-shot learning with OpenAI LLM API

Note that to use the models with OpenAI API, you need to set an environment variable OPENAI_API_KEY before running the scripts as described in the following sections.

Automatic Evaluation

1. Run inference to generate responses

Use inference.py to generate responses for the test set. Below is an example command using LLM (OpenAI's gpt-4) in a few-shot setting. See more options by running python inference.py --help.

Note

Exact command we used can be found in run_scripts/run_inference_*.sh.

export OPENAI_API_KEY="<YOUR_API_KEY>"

python inference.py \
    --tod_model_type "openai-fs" \
    --model_name_or_path "gpt-4" \
    --max_output_length 256 \
    --output_dir "output/gpt4-fs-olen256" \
    --task_name "e2e" \
    --resume_last_run \
    --world_size 1

Generated dialogue states and responses will be saved in output/gpt4-fs-olen256/e2e.inference.json.

2. Evaluate the generated responses

Use evaluate.py to evaluate the generated responses. Below is an example command evaluating the output of the above inference. See more options by running python evaluate.py --help.

Note

Exact command we used can be found in run_scripts/run_evaluate.sh.

python evaluate.py \
    --dataset_dpath "dataset/JMultiWOZ_1.0" \
    --inference_output_dpath "output/gpt4-fs-olen256" \
    --task_name "e2e"

Resulted scores will be saved in output/gpt4-fs-olen256/e2e.scores_summary.json.

End-to-End generation results reported in our paper

Model	JGA	Slot F1	BLEU
T5-base	0.59	0.95	39.7
T5-large	0.77	0.98	46.4
GPT-3.5 zero-shot	0.16	0.70	5.40
GPT-3.5 few-shot	0.25	0.82	12.18
GPT-4 zero-shot	0.31	0.86	8.72
GPT-4 few-shot	0.36	0.89	14.94

Human Evaluation

We provide a web UI for human evaluation. The web UI allows you to interact with the models and judge the dialogue quality.

1. Run evaluation web UI

export OPENAI_API_KEY="<YOUR_API_KEY>" # Required if you use models with OpenAI API

python human_eval.py \
    --tod_model_names gpt4-fs gpt3.5-fs \
    --threading_httpd

To interact with specific models and user goals setting, you can use --task_ids_fpath option, which is a path to a JSON file containing a list of task IDs. See task_ids.json for example.

2. Open the web page in your browser

For local access: http://localhost:8080/dialogue
For remote access: http://your-server-ip/dialogue

If you use --task_ids_fpath option, you can specify the task ID by adding ?task_id=<task_id> to the URL (e.g., http://localhost:8080/dialogue?task_id=task_B).

Results reported in our paper

Model	# Participants	Success	Turn	Satisfaction
T5-base	38	55.26	11.21	3.05
T5-large	40	65.00	10.93	3.55
GPT-3.5 few-shot	41	24.39	10.37	2.15
GPT-4 few-shot	42	57.14	9.64	3.33

Citation

@inproceedings{ohashi-lrec2024-jmultiwoz,
    title = "JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset",
    author = "Ohashi, Atsumoto and Hirai, Ryu and Iizuka, Shinya and Higashinaka, Ryuichiro",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",
    year = "2024",
    url = "",
    pages = "",
}

@inproceedings{ohashi-nlp2023-jmultiwoz,
    title = "JMultiWOZ: 日本語タスク指向型対話データセットの構築",
    author = "大橋, 厚元 and 平井, 龍 and 飯塚, 慎也 and 東中, 竜一郎",
    booktitle = "言語処理学会第29回年次大会 ",
    year = "2023",
    url = "https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q12-1.pdf",
    pages = "3093--3098",
}

@inproceedings{ohashi-nlp2024-jmultiwoz,
    title = "JMultiWOZ に対する対話状態アノテーションの付与と対話システムの実装評価",
    author = "大橋, 厚元 and 平井, 龍 and 飯塚, 慎也 and 東中, 竜一郎",
    booktitle = "言語処理学会第30回年次大会 ",
    year = "2024",
    url = "https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B10-5.pdf",
    pages = "2744--2749",
}

Aknowledgement

This work was supported by JST Moonshot R&D Grant number JPMJMS2011.

License

The JMultiWOZ dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Models trained using the dataset are not considered copies or direct derivatives of the dataset itself, and thus are not required to comply with the terms of this license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JMultiWOZ: Japanese Multi-Domain Wizard-of-Oz Dataset

Requirements

Dataset

Dialogue Models

Automatic Evaluation

1. Run inference to generate responses

2. Evaluate the generated responses

End-to-End generation results reported in our paper

Human Evaluation

1. Run evaluation web UI

2. Open the web page in your browser

Results reported in our paper

Citation

Aknowledgement

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
dataset		dataset
run_scripts		run_scripts
tod_models		tod_models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
human_eval.py		human_eval.py
inference.py		inference.py
requirements.txt		requirements.txt
task_ids.json		task_ids.json

License

nu-dialogue/jmultiwoz

Folders and files

Latest commit

History

Repository files navigation

JMultiWOZ: Japanese Multi-Domain Wizard-of-Oz Dataset

Requirements

Dataset

Dialogue Models

Automatic Evaluation

1. Run inference to generate responses

2. Evaluate the generated responses

End-to-End generation results reported in our paper

Human Evaluation

1. Run evaluation web UI

2. Open the web page in your browser

Results reported in our paper

Citation

Aknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages