Implementation for paper DebugBench: Evaluating Debugging Capabilities of Large Language Models with datasets, prompts, model outputs.
Please refer to the Hugging Face Dataset for the data source and evaluation script if you want to use the benchmark.
DebugBench is a Large Language Model (LLM) debugging benchmark introduced in the paper "DebugBench: Evaluating Debugging Capability of Large Language Models" [url]. We collect code snippets from the LeetCode community and implant bugs into source data with GPT-4.
- It consists of 4,253 instances.
- It covers four major bug categories and 18 minor types.
- It includes C++, Java, and Python instances.
- It contains three difficulty levels: easy, medium, and hard.
- All the instances were released after June 2022.
- Please refer to the article [url] for more details.
This repository contains the implementation for benchmark construction and evaluation.
-
benchmark
directory contains the 51 JSON shards of different languages and bug types of the benchmark. -
dataset_construction
directory contains the implementation for bug implantation to solution code via LLMs. -
evaluation
directory contains the implementation for evaluating the debugging capabilities of LLMs with API. -
evalution_result
directory contains the model output ofgpt-4-0613
,gpt-3.5-turbo-0613
andCodeLlama-34b-instruct
under different scenarios.
More elements will be added to the repository soon.
Please cite the paper and star the repo if you use DebugBench and find it helpful.
Feel free to contact trc20@mails.tsinghua.edu.cn or open an issue if you have any questions.
@misc{tian2024debugbench,
title={DebugBench: Evaluating Debugging Capability of Large Language Models},
author={Runchu Tian and Yining Ye and Yujia Qin and Xin Cong and Yankai Lin and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2401.04621},
archivePrefix={arXiv},
primaryClass={cs.SE}
}