Introducing SWE-bench Verified | OpenAI #933
Labels
AI-Agents
Autonomous AI agents using LLMs
code-generation
code generation models and tools like copilot and aider
data-validation
Validating data structures and formats
github
gh tools like cli, Actions, Issues, Pages
llm-benchmarks
testing and benchmarking large language models
openai
OpenAI APIs, LLMs, Recipes and Evals
python
Python code, tools, info
software-engineering
Best practice for software engineering
Introducing SWE-bench Verified | OpenAI
Snippet
"Introducing SWE-bench Verified
We're releasing a human-validated subset of SWE-bench that more reliably evaluates AI models' ability to solve real-world software issues.
Download SWE-bench Verified(opens in a new window)"
Background on SWE-bench
Each sample in the SWE-bench test set is created from a resolved GitHub issue in one of 12 open-source Python repositories on GitHub. Each sample has an associated pull request (PR), which includes both the solution code and unit tests to verify code correctness. These unit tests fail before the solution code in the PR is added, but pass afterwards, and are therefore called FAIL_TO_PASS tests. Each sample also has associated PASS_TO_PASS tests, which pass both before and after the PR is merged, and are used to check that existing unrelated functionality in the codebase has not been broken by the PR.
For each sample in SWE-bench, agents are provided with the original text from the GitHub issue, known as the problem statement, and are given access to the codebase. Given these, agents must edit the files in the codebase to resolve the issue. The tests are not shown to the agent.
A proposed edit is evaluated by running both the FAIL_TO_PASS and PASS_TO_PASS tests. If the FAIL_TO_PASS tests pass, this means the edit solves the issue. If the PASS_TO_PASS tests pass, then the edit has not inadvertently broken unrelated sections of the codebase. Both sets of tests are required to pass for the edit to fully resolve the original GitHub issue.
Adapting SWE-bench as a Preparedness Evaluation
Given the potential relevance of SWE-bench for the Preparedness Framework, we aimed to find ways in which we could improve the robustness and reliability of the benchmark. We identified three major areas for improvement2:
SWE-bench Verified
To address these issues, we launched a human annotation campaign with professional software developers to screen each sample of the SWE-bench test set for appropriately scoped unit tests and well-specified issue descriptions.
Together with the authors of SWE-bench, we are releasing SWE-bench Verified: a subset of the original test set from SWE-bench, consisting of 500 samples verified to be non-problematic by our human annotators. This version supersedes the original SWE-bench and SWE-bench Lite test sets. Additionally, we are releasing our human annotations for all SWE-bench test samples.
We also collaborated with the SWE-bench authors to develop a new evaluation harness for SWE-bench(opens in a new window) which uses containerized Docker environments to make evaluating on SWE-bench easier and more reliable.
On SWE-bench Verified, GPT-4o resolves 33.2% of samples3, with the best performing open-source scaffold, Agentless, doubling its previous score of 16% on SWE-bench.
Our Approach
We worked with 93 software developers experienced in Python to manually screen SWE-bench samples for quality. We annotated 1,699 random samples from the SWE-bench test set to produce SWE-bench Verified. The following analysis is based on the 1,699 samples.
We annotate samples to capture:
Additionally, we rate the difficulty of each sample by having annotators estimate how long it would take for a developer to decide upon and implement the solution, assuming the sample is non-problematic. Finally, we provide a freeform input option to flag any other major issues with the sample.
Annotation Results
We see that 38.3% of samples were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. Overall, our annotation process resulted in 68.3% of SWE-bench samples being filtered out due to underspecification, unfair unit tests, or other issues.
Performance on SWE-bench Verified
We found that GPT-4o's performance on the best-performing scaffold reaches 33.2% on SWE-bench Verified, more than doubling its score of 16% on the original SWE-bench. In general, this validates our initial suspicion that the original SWE-bench dataset underestimates agent abilities.
Discussion & Limitations
We believe in an empirical and scientific approach to tracking and protecting against catastrophic risk. Building and continually improving evaluations is a key element of this work. There remains much to be done, and we're eager to see more work from the community in contributing valuable benchmarks like SWE-bench.
Data downloads
Authors
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, Aleksander Madry
Acknowledgements
We're grateful to Carlos Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan for developing the original SWE-bench benchmark; the Preparedness team for supporting this work; Tao Lin, who initially pointed out many of these issues; Ian Kivlichan and Sarah Schwettmann for feedback on an earlier version of this manuscript; and the many human annotators who helped create SWE-bench Verified.
Suggested labels
None
The text was updated successfully, but these errors were encountered: